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HUMAN GENES AND GENE EXPRESSION PRODUCTS V 
Field of the Invention 

The present invention relates to polynucleotides of human origin and the encoded gene 
products. 

Background of the Invention 
Identification of novel polynucleotides, particularly those that encode an expressed gene 
product, is important in the advancement of drug discovery, diagnostic technologies, and the 
understanding of the progression and nature of complex diseases such as cancer. Identification of 
genes expressed in different cell types isolated from sources that differ in disease state or stage, 
developmental stage, exposure to various env ironmental factors, the tissue of origin, the species 
from which the tissue was isolated, and the like is key to identifying die genetic factors that are 
responsible for the phenotypes associated with these various differences. 

This invention provides novel human polynucleotides, the polypeptides encoded by these 
polynucleotides, and the genes and proteins corresponding to these novel polynucleotides. 

Summary of the Invention 
This invention relates to novel human polynucleotides and variants thereof, their encoded 
polypeptides and variants thereof, to genes corresponding to these polynucleotides and to proteins 
expressed by the genes. The invention also relates to diagnostics and therapeutics comprising such 
novel human polynucleotides, their corresponding genes or gene products, including probes, 
antisense nucleotides, and antibodies. The polynucleotides of the invention correspond to a 
polynucleotide comprising the sequence information of at least one of SEQ ID NOS: 1-2707. 

Various aspects and embodiments of the invention will be readily apparent to the ordinarily 
skilled artisan upon reading the description provided herein. 

Detailed Description of the Invention 
The invention relates to polynucleotides comprising the disclosed nucleotide sequences, to 
full length cDNA, mRNA genomic sequences, and genes corresponding to these sequences and 
degenerate variants thereof, and to polypeptides encoded by the polynucleotides of the invention and 
polypeptide variants. The following detailed description describes the polynucleotide compositions 
encompassed by the invention, methods for obtaining cDNA or genom.c DNA encoding a full- 
length gene product, expression of these polynucleotides and genes, identification ot structural 
motifs of the polynucleotides and genes, identification of the function of a gene product encoded by 
a gene corresponding to a polynucleotide of the invention, use of the provided polynucleotides as 
probes and in mapping and in tissue profiling, use of the corresponding polypeptides and other gene 
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products to ra.se anybodies, and use of the polynucleotides and the, encoded gene products for 
therapeutic and diagnostic purposes. 
Polynucleotide Com positions 

"~ The scope of the invent.on with respect to polynucleotide compositions includes, but is not 
s necessar.lv limited to. polynucleotides having a sequence se, forth in any one of SEQ ID NOS:l- 
2707: polynucleotides obtained from the b.olog.cal matertals described here,,, or other btological 
sources (particularly human sources) by hybridization under stringent conditions (particularly 
conditions of high stringency): genes correspond.ng to the provided polynucleotides: variants of the 
provided polynucleotides and their correspond.ng genes, particularly those variants that retain a 
,0 biological activttv of the encoded gene product (,,,. a biological activity ascnbed to a gene product 
correspond.ng to the prov.ded polynucleotides as a result of the assignment of the gene product to a 
protein family(ies) and/or identification of a functional domain present in the gene product). Other 
nucleic acid compositions contemplated by and within the scope of the present invention will be 
readily apparent to one of ord.nary skill in the art when provided with the disclosure here. 
, 5 •■Polynucleotide" and "nucleic acid" as used herein with reference to nucleic acids of the 

composition is not .ntended to be limiting as to the length or structure of the nucleic acid unless 
specifically indicted. 

The ,nven..on features polynucleotides that are expressed in human tissue, specifically 
human colon, breast, and/or lung tissue. Novel nucle.c acid compositions of the invention of 

20 particular interest compr.se a sequence se. forth in any one of SEQ ID NOS: 1-2707 or an 

identifying sequence thereof. An "identifying sequence" is a contiguous sequence of residues at 
least about 10 n, to about 20 nt in length, usually at least about 50 nt to about 100 nt in length, that 
uniquely identifies a polynucleotide sequence, e*. exh.b.ts less than 90%. usually less than about 
80% to about 85% sequence identity to any contiguous nucleotide sequence of more than about 

25 20 nt. Thus, the subject novel nucle.c ae.d compositions include full length cDNAs or mRNAs that 
encompass an identifying sequence of contiguous nucleotides from any one of SEQ ID NOS: l- 
2707. 

The polynucleotides of the invent.on also include polynucleotides having sequence 
similarity or sequence identity. Nucleic acids having sequence similarity are detected by 

50 hybridization under low stringency conditions, for example, at 50°C and .0XSSC (0 9 M sa.ine/0.09 
M sod.um ctrate) and remain bound when subjected to washing at 55°C in 1XSSC Sequence 
identity can be determined by hybridization under stringent conditions, for example, at 50°C or 
higher and 0.1XSSC (9 mM saline/0.9 mM sod.um ctrate). Hybridization methods and conditions 
are well known in the art. see. e.g., USPN 5.707.829. Nucleic acids that are substantially identical to 

35 the provided polynucleotide sequences, e.g. allelic variants, genetically altered vers.ons of the gene. 
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etc . bind to the provided polynucleotide sequences ( SF.Q II) NOS: 1-2707) under stringent 
hybridization conditions. By using probes, particularly labeled probes of DNA sequences, one can 
isolate homologous or related genes. T he source of homologous genes can be any species, e.g. 
primate species, particularly human: rodents, such as rats and mice, canines, felines, bovines. 
ovines. equines. yeast, nematodes, etc. 

Preferably, hybridization is performed using at least 1 5 contiguous nucleotides (nt)of at 
least one of SEQ ID NOS: 1-2707. That is. when at least 15 contiguous nt of one of the disclosed 
SEQ ID NOS. is used as a probe, the probe will preferentially hybridize with a nucleic acid 
comprising the complementary sequence, allowing the identification and retrieval of the nucleic 
acids that uniquely hybridize 10 the selected probe. Probes from more than one SEQ ID NO. can 
hybridize with the same nucleic acid if the cDNA from which they were derived corresponds to one 
mRNA. Probes of more than 1 5 nt can be used. e.g.. probes of from about 1 8 nt to about 100 nt. but 
15 nt represents sufficient sequence for unique identification. 

The polynucleotides of the invention also include naturally occurring variants of the 
nucleotide sequences (e.g.. degenerate variants, allelic variants, etc.). Variants of the 
polynucleotides of the invention are identified by hybridization of putative variants with nucleotide 
sequences disclosed herein, preferably by hybridization under stringent conditions. For example, by 
using appropriate wash conditions, variants of the polynucleotides of the invention can be identified 
where the allelic variant exhibits at most about 25-30% base pair (bp) mismatches relative to the 
selected polynucleotide probe. In general, allelic variants contain 15-25% bp mismatches, and can 
contain as little as even 5-15%. or 2-5%. or 1-2% bp mismatches, as well as a single bp mismatch. 

The invention also encompasses homologs corresponding to the polynucleotides of SEQ ID 
NOS: 1-2707. where the source of homologous genes can be any mammalian species, e.g., primate 
species, particularly human: rodents, such as rats: canines, felines, bovines. ovines. equines. yeast, 
nematodes, etc. Between mammalian species, e g, human and mouse, homologs generally have 
substantial sequence similarity, e.g., at least 75% sequence identity, usually at least 90%. more 
usually at least 95% between nucleotide sequences. Sequence similarity is calculated based on a 
reference sequence, which may be a subset of a larger sequence, such as a conserved motif, coding 
region, flanking region, etc. A reference sequence will usually be at least about 18 contiguous nt 
long, more usually at least about "50 nt long, and may extend to the complete sequence that is being 
compared. Algorithms for sequence analysis are known in the art. such as gapped BLAST, 
described in Altschul. et al. Xucleic Acids Res. (1997) 25:3389-3402. 

In general, variants of the invention have a sequence identity greater than at least about 
65%. preferably at least about 75%. more preferably at least about 85%. and can be greater than at 
least about 90% or more as determined by the Smith- Waterman homology search algorithm as 
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implemented in MPSRCH program (Oxford Molecular). For the purposes of this invention, a 
preferred method of calculating percent identity is the Smith- Waterman algorithm, using the 
following. Global DNA sequence identity must be greater than 65% as determined by the Smith- 
Waterman homology search algorithm as implemented in MPSRCH program (Oxford Molecular) 
5 using an affine gap search with the following search parameters: gap open penalty. ! 2: and gap 
extension penalty, 1 

The subject nucleic acids can be cDNAs or genomic DNAs. as well as fragments thereof, 
particularly fragments that encode a biologically active gene product and/or are useful in the 
methods disclosed herein {e.g.. in diagnosis, as a unique identifier of a differentially expressed gene 
1 0 of interest, etc. ). The term '"cDNA" as used herein is intended to include all nucleic acids that share 
the arrangement of sequence elements found in native mature mRNA species, where sequence 
elements are exons and 3' and 5* non-coding regions. Normally mRNA species have contiguous 
exons. with the intervening introns. when present, being removed by nuclear RNA splicing, to create 
a continuous open reading frame encoding a polypeptide of the invention. 
1 5 A genomic sequence of interest comprises the nucleic acid present between the initiation 

codon and the stop codon. as defined in the listed sequences, including all of the introns that are 
normally present in a native chromosome. It can further include the Y and 5" untranslated regions 
found in the mature mRNA. It can further include specific transcriptional and translationai 
regulatory sequences, such as promoters, enhancers, etc.. including about 1 kb. but possibly more, of 
20 flanking genomic DNA at either the 5' and 3' end of the transcribed region. The genomic DNA can 
be isolated as a fragment of 100 kbp or smaller: and substantially free of flanking chromosomal 
sequence. The genomic DNA flanking the coding region, either 3" and 5'. or internal regulatory 
sequences as sometimes found in introns. contains sequences required for proper tissue, stage- 
specific, or disease-state specific expression. 
25 The nucleic acid compositions of the subject invention can encode all or a part of the subject 

polypeptides. Double or single stranded fragments can be obtained from the DNA sequence by 
chemically synthesizing oligonucleotides in accordance with conventional methods, by restriction 
enzyme digestion, by PCR amplification, etc. Isolated polynucleotides and polynucleotide 
fragments of the invention comprise at least about 10. about 1 5, about 20. about 35. about 50. about 
30 100. about I 50 to about 200. about 250 to about 300. or about 350 contiguous nt selected from the 

polynucleotide sequences as shown in SEQ ID NOS: 1 -2707. For the most part, fragments will be of 
at least 1 5 nt. usually at least 1 8 nt or 25 nt. and up to at least about 50 contiguous nt in length or 
more. In a preferred embodiment, the polynucleotide molecules comprise a contiguous sequence of 
at least 12 nt selected from the group consisting of the polynucleotides shown in SEQ ID NOS: I- 
35 2707. 
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Probes specific to the polynucleotides of the invention can be generated using the 
polynucleotide sequences disclosed in SEQ ID NOS: 1 -2707. The probes are preferably at least 
about a 12. 15. 16. 18.20, 22. 24. or 25 nt fragment of a corresponding contiguous sequence of SFQ 
ID NOS: 1-2707. and can be less than 2. 1 . 0.5. 0.1. or 0.05 kb in length. The probes can be 

5 synthesized chemically or can be generated from longer polynucleotides using restriction enzymes. 
The probes can be labeled, for example, with a radioactive, biotinylated. or fluorescent tag. 
Preferably, probes are designed based upon an identify ing sequence of a polynucleotide of one of 
SEQ ID NOS: 1-2707 . More preferably, probes are designed based on a contiguous sequence of one 
of the subject polynucleotides that remain unmasked follow ing application of a masking program for 

10 masking low complexity (e.g.. XBLAST) to the sequence., i.e.. one would select an unmasked 
region, as indicated by the polynuc leotides outside the poly-n stretches of the masked sequence 
produced by the masking program. 

The polynucleotides of the subject invention are isolated and obtained in substantial purity, 
generally as other than an intact chromosome. Usually, the polynucleotides, either as DNA or RNA. 

15 will be obtained substantially free of other naturally-occurring nucleic acid sequences, generally 
being at least about 50%. usually at least about 90% pure and are typically "recombinant^, e.g.. 
flanked by one or more nucleotides with which it is not normally associated on a naturally occurring 
chromosome. 

The polynucleotides of the invention can be provided as a linear molecule or within a 
20 circular molecule, and can be provided within autonomously replicating molecules (vectors) or 

within molecules without replication sequences. Expression of the polynucleotides can be regulated 

by their own or by other regulatory sequences known in the art. The polynucleotides of the 

invention can be introduced into suitable host cells using a variety of techniques available in the art. 

such as transferrin polycation-mediated DNA transfer, transfection with naked or encapsulated 
25 nucleic acids, liposome-mediated DNA transfer, intracellular transportation of DNA-coated latex 

beads, protoplast fusion, viral infection, electroporation. gene gun. calcium phosphate-mediated 

transfection. and the like. 

The subject nucleic acid compositions can be used to, for example, produce polypeptides, as 

probes for the detection of mRNA of the invention in biological samples {e.g.. extracts of human 
*0 cells) to generate additional copies of the poly nucleotides, to generate ribozymes or antisense 

oligonucleotides, and as single stranded DNA probes or as triple-strand forming oligonucleotides. 

The probes described herein can be used to. for example, determine the presence or absence of the 

polynucleotide sequences as shown in SEQ ID NOS: 1 -2707 or variants thereof in a sample. These 

and other uses are described in more detail below. 
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Use of Polynucleotides to Obtain Full-l.enmh cDNA. Gene, and Promoter Region 
Full-length cDNA molecules comprising the disclosed polynucleotides are obtained as 
follows. A polynucleotide having a sequence of one of SEQ ID NOS: 1-2707, or a portion thereof 
comprising at least 12. 15, 18. or 20 nt. is used as a hybridization probe to detect hybridizing 

5 members of a cDNA library using probe design methods, cloning methods, and clone selection 

techniques such as those described in USPN 5.654.173. Libraries of cDNA are made from selected 
tissues, such as normal or tumor tissue, or from tissues of a mammal treated with, for example, a 
pharmaceutical agent. Preferably, the tissue is the same as the tissue from which the 
poly nucleotides of the invention were isolated, as both the polynucleotides described herein and the 

10 cDNA represent expressed genes. Most preferably, the cDNA library is made from the biological 
material described herein in the Examples. The choice of cell type for library construction can be 
made after the identity of the protein encoded by the gene corresponding to the polynucleotide of the 
invention is known. This will indicate which tissue and cell types are likely to express the related 
gene, and thus represent a suitable source for the mRNA for generating the cDNA. Where the 

1 5 provided polynucleotides are isolated from cDNA libraries, the libraries are prepared from mRNA 
of human colon cells, more preferably, human colon cancer cells, even more preferably, from a 
highly metastatic colon cell. Kml2L4-A. 

Techniques for producing and probing nucleic acid sequence libraries arc described, for 
example, in Sambrook et cil. Molecular Cloning: A Laboratory Manual 2nd Ed.. (1989) Cold 

20 Spring Harbor Press, Cold Spring Harbor, NY. The cDNA can be prepared by using primers based 
on sequence from SEQ ID NOS: 1 2707 . In one embodiment, the cDNA library can be made from 
only poly-adenylated mRNA. Thus, poly-T primers can be used to prepare cDNA from the mRNA. 

Members of the library that are larger than the provided polynucleotides, and preferably that 
encompass the complete coding sequence of the native message, are obtained. In order to confirm 

25 that the entire cDNA has been obtained. RNA protection experiments are performed as follows. 

Hybridization of a full-length cDNA to an mRNA will protect the RNA from RNase degradation. If 
the cDNA is not full length, then the portions of the mRNA that are not hybridized will be subject to 
RNase degradation. This is assayed, as is known in the art, by changes in electrophoretic mobility 
on polyacrylamide gels, or by detection of released monoribonucleotides. Sambrook et ai, 

30 Molecular Cloning: A Laboratory Manual 2nd Ed.. (1989) Cold Spring Harbor Press. Cold Spring 
Harbor. NY. In order to obtain additional sequences 5' to the end of a partial cDNA. 5' RACE (PCR 
Protocols: A Guide to Methods and Applications, (1990) Academic Press. Inc.) can be performed. 

Genomic DNA is isolated using the provided polynucleotides in a manner similar to the 
isolation of full-length cDNAs. Briefly, the provided polynucleotides, or portions thereof, are used 

35 as probes to libraries of genomic DNA. Preferably, the library is obtained from the cell type that 
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was used to generate the polynucleotides of the invention, but this is not essential. Most preferably, 
the genomic DNA is obtained from the biological material described herein in the Examples. Such 
libraries can be in vectors suitable for carrying large segments of a genome, such as PI or YAC. as 
described in detail in Sambrook et ai. 9.4-9.30. In addition, genomic sequences can be isolated 
5 from human BAC libraries, which arc commercially avatlable from Research Genetics. Inc.. 
Huntsville. Alabama. USA. for example. In order to obtain additional 5' or 3' sequences, 
chromosome walking is performed, as described in Sambrook el «/.. such that adjacent and 
overlapping fragments of genomic DNA are isolated. These are mapped and pieced together, as is 
known in the art. using restriction digestion enzymes and DNA ligase. 
10 Using the polynucleotide sequences of the invention, corresponding full-length genes can be 

isolated using both classical and PCR methods to construct and probe cDNA libraries. Using either 
method. Northern blots, preferably, are performed on a number of cell types to determine which cell 
lines express the gene of interest at the highest level. Classical methods of constructing cDNA 
libraries are taught in Sambrook et ai. supra. With these methods. cDNA can be produced from 
1 5 mRNA and inserted into viral or expression vectors. Typically, libraries of mRNA comprising 
poly(A) tails can be produced with poly(T) primers. Similarly, cDNA libraries can be produced 
using the instant sequences as primers. 

PCR methods are used to amplify the members of a cDNA library that comprise the desired 
insert. In this case, the desired insert will contain sequence from the full length cDNA that 
20 corresponds to the instant polynucleotides. Such PCR methods include gene trapp.ng and RACE 
methods. Gene trapping entails inserting a member of a cDNA library into a vector. The vector 
then is denatured to produce single stranded molecules. Next, a substrate-bound probe, such a 
biotinylated oligo. is used to trap cDNA inserts of interest. Biotmylated probes can be linked to an 
avidin-bound solid substrate. PCR methods can be used to amplify the trapped cDNA, To trap 
25 sequences corresponding to the full length genes, the labeled probe sequence is based on the 

polynucleotide sequences of the invention. Random primers or primers specific to the library vector 
can be used to amplify the trapped cDNA. Such gene trapping techniques are described in Gruber et 
ai. WO 95/04745 and Gruber et al. USPN 5.500.356. Kits are commercially available to perform 
gene trapping experiments from, for example. Life Technologies. Ga.thersburg. Maryland. USA. 
30 -Rap>d amplification of cDNA ends." or RACE, is a PCR method of amplifying cDNAs 

from a number of different RNAs. The cDNAs are ligated to an oligonucleotide linker, and 
amplified by PCR using two primers. One primer is based on sequence from the instant 
polynucleotides, for which full length sequence is desired, and a second primer comprises sequence 
that hybridizes to the oligonucleotide linker to amplify the cDNA. A description of this methods is 
35 reported in WO 97/191 10. In preferred embodiments of RACE, a common primer is designed to 
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anneal to an arbitrary adaptor sequence ligated to cDNA ends ( Apte and Siebert. Bioiechmques 
(1993) 75:890-893: Edwards?/ al.. Nuc. Acids Res. ( 199] ) 79:5227-5232). When a single gene- 
specific RACE primer is paired with the common primer, preferential amplification of sequences 
between the single gene specific primer and the common primer occurs. Commercial cDNA pools 
5 modified for use in RACE are available. 

Another PCR-based method generates full-length cDNA library with anchored ends without 
needing specific knowledge of the cDNA sequence. The method uses lock-docking primers (1-VI). 
where one primer, poly TV (MI!) locks over the polyA tail of eukaryotic mRNA producing first 
strand synthesis and a second primer. polyGH (1V-VI) locks onto the poiyC tail added by terminal 
1 0 deoxynucleotidyl transferase (TdT)(see, e.g.. WO 96/40998). 

The promoter region of a gene generally is located 5* to the initiation site for RNA 
polymerase II. Hundreds of promoter regions contain the "TATA" box. a sequence such as TATTA 
or TATAA. which is sensitive to mutations. The promoter region can be obtained by performing 5 
RACE using a primer from the coding region of the gene. Alternatively, the cDNA can be used as a 
] 5 probe for the genomic sequence, and the region 5' to the coding region is identified by 'Walking 
up." If the gene is highly expressed or differentially expressed, the promoter from the gene can be 
of use in a regulatory construct for a heterologous gene. 

Once the full-length cDNA or gene is obtained. DNA encoding variants can be prepared by 
site-directed mutagenesis, described in detail in Sambrook et a!.. 1 5.3-1 5.63. The choice of codon or 
20 nucleotide to be replaced can be based on disclosure herein on optional changes in amino acids to 
achieve altered protein structure and/or function. 

As an alternative method to obtaining DNA or RNA from a biological material, nucleic acid 
comprising nucleotides having the sequence of one or more polynucleotides of the invention can be 
synthesized. Thus, the invention encompasses nucleic acid molecules ranging in length from 15 nt 
25 (corresponding to at least 1 5 contiguous nt of one of SEQ ID NOS: 1-2707) up to a maximum length 
suitable for one or more biological manipulations, including replication and expression, of the 
nucleic acid molecule. The invention includes but is not limited to (a) nucleic acid having the size 
of a full gene, and comprising at least one of SEQ ID NOS: 1-2707; (b) the nucleic acid of (a) also 
comprising at least one additional gene, operably linked to permit expression of a fusion protein: (c ) 
30 an expression vector comprising (a) or (b); (d) a piasmid comprising (a) or (b) ; and (e) a 

recombinant viral particle comprising (a) or (b). Once provided with the polynucleotides disclosed 
herein, construction or preparation of (a) - (e) are well within the skill in the art. 

The sequence of a nucleic acid comprising at least 15 contiguous nt of at least any one of 
SEQ ID NOS: 1 -2707, preferably the entire sequence of at least any one of SEQ ID NOS: 1 -2707. is 
35 not limited and can be any sequence of A. T. G. and/or C (for DNA) and A. U. G. and/or C (for 
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RNA) or modified bases thereof, including inosine and pseudouridine. The choice of sequence will 
depend on the desired function and can be dictated by coding regions desired, the intron-like regions 
desired, and the regulatory regions desired. Where the entire sequence of any one of SEQ ID 
NOS: 1-2707 is within the nucleic acid, the nucleic acid obtained is referred to herein as a 

5 polynucleotide comprising the sequence of any one of SEQ ID NOS: 1-2707. 
Expression of Polypeptide Encoded by Full-Length cDNA or EuM-Length Gene 

The provided polynucleotides [e.g., a polynucleotide hav ing a sequence of one of SEQ ID 
NOS: 1 -2707), the corresponding cDN A. or the full-length gene is used to express a partial or 
complete gene product. Constructs of polynucleotides having sequences of SEQ ID NOS: 1-2707 

10 can also be generated synthetically. Alternatively, single-step assembly of a gene and entire ptasmid 
from large numbers of oligodeoxyribonucleotides is described by. e.g.. Stemmer et al. Gene 
(Amsterdam) (1995) 1 64( I )A9-53 . In this method, assembly PCR (the synthesis of long DNA 
sequences from large numbers of oligodeoxyribonucleotides (oligos)) is described. The method is 
derived from DNA shuffling (Stemmer. Mature (\994) 570:389-39 1 ). and does not rely on DNA 

15 ligase. but instead relies on DNA polymerase to build increasingly longer DNA fragments during the 
assembly process. 

Appropriate polynucleotide constructs are purified using standard recombinant DNA 
techniques as described in. for example. Sambrook et al.. Molecular Cloning: A Laboratory Manual 
2nd Ed.. (1989) Cold Spring Harbor Press, Cold Spring Harbor. NY. and under current regulations 

20 described in United States Dept. of HHS. National Institute of Health (NIH) Guidelines for 

Recombinant DNA Research. The gene product encoded by a polynucleotide of the invention is 
expressed in any expression system, including, for example, bacterial, yeast, insect, amphibian and 
mammalian systems. Vectors, host cells and methods for obtaining expression in same are well 
known in the art. Suitable vectors and host cells are described in USPN 5.654.173. 

25 Polynucleotide molecules comprising a polynucleotide sequence provided herein are 

generally propagated by placing the molecule in a vector. Viral and non-viral vectors are used, 
including plasmids. The choice of plasmid will depend on the type of cell in which propagation is 
desired and the purpose of propagation. Certain vectors are useful for amplifying and making large 
amounts of the desired DNA sequence. Other vectors are suitable for expression in cells in culture. 

30 Still other vectors are suitable for transfer and expression in cells in a whole animal or person. The 
choice of appropriate vector is well within the skill of the art. Many such vectors are available 
commercially. Methods for preparation of vectors comprising a desired sequence are well known in 
the art. 

The polynucleotides set forth in SEQ ID NOS: 1 -2707 or their corresponding full-length 
35 polynucleotides are linked to regulatory sequences as appropriate to obtain the desired expression 
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properties. These can include promoters (attached either at the 5' end or the sense strand or at the 3' 
end of the antisense strand), enhancers, terminators, operators, repressors, and inducers. The 
promoters can be regulated or constitutive. In some situations it may be desirable to use 
conditionally active promoters, such as tissue-specific or developmental stage-specific promoters. 

5 These are linked to the desired nucleotide sequence using the techniques described above for linkage 
10 vectors. Any techniques known in the an can be used. 

When any of the above host cells, or other appropriate host cells or organisms, are used to 
replicate and/or express the polynucleotides or nucleic acids of the invention, the resulting replicated 
nucleic acid. RNA. expressed protein or polypeptide, is within the scope of the invention as a 

10 product of the host cell or organism. The product is recovered by any appropriate means known in 
the art. 

Once the gene corresponding to a selected polynucleotide is identified, its expression can be 
regulated in the cell to which the gene is native. For example, an endogenous gene of a cell can be 
regulated by an exogenous regulatory sequence as disclosed in USPN 5.64 1 .670. 

15 

Identification of Functional and Structural Motifs of Novel Genes Screening Against Publicly 
Available Databases 

Translations of the nucleotide sequence of the provided polynucleotides. cDNAs or full 

genes can be aligned with individual known sequences. Similarity with individual sequences can be 
20 used to determine the activity of the polypeptides encoded by the polynucleotides of the invention. 

Also, sequences exhibiting similarity with more than one individual sequence can exhibit activities 

that are characteristic of either or both individual sequences. 

The full length sequences and fragments of the polynucleotide sequences of the nearest 

neighbors can be used as probes and primers to identify and isolate the full length sequence 
25 corresponding to provided polynucleotides. The nearest neighbors can indicate a tissue or cell type 

to be used to construct a library for the full-length sequences corresponding to the provided 

polynucleotides. 

Typically, a selected polynucleotide is translated in all six frames to determine the best 
alignment with the individual sequences. The sequences disclosed herein in the Sequence Listing 

30 are in a 5 Mo V orientation and translation in three frames can be sufficient (with a few specific 

exceptions as described in the Examples). These amino acid sequences are referred to, generally, as 
query sequences, which will be aligned with the individual sequences. Databases with individual 
sequences are described in "Computer Methods for Macromolecular Sequence Analysis" Methods in 
Enzymology (1996) 266, Doolittle, Academic Press, Inc.. a division of Harcourt Brace & Co., San 

35 Diego. California. USA. Databases include GenBank. EMBL, and DNA Database of Japan (DDBJ). 
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Query and individual sequences can be aligned using the methods and computer programs 
described above, and include BLAST 2.0. available over the world wide web at 
I mp: - " ww.nchi.nlm.nih.izox BI AS V . See also Altschul. et al. Nucleic Acids Res. (1997) 25:3389- 
3402. Another alignment algorithm is Fasta. available in the Genetics Computing Group (GCG) 

5 paekaee. Madison. Wisconsin. USA, a wholly owned subsidiary of Oxford Molecular Group, Inc. 
Other techniques for alignment are described in Doolittle, supra. Preferably, an alignment program 
that permits gaps in the sequence is utilized to align the sequences. The Smith-Waterman is one 
type of algorithm that permits gaps in sequence alignments. See Meth. Mol. Biol. (1997) 70: 173- 
1 87. Also, the GAP program using the Needleman and Wunsch alignment method can be utilized to 

1 0 alien sequences. An alternative search strategy uses MPSRCH software, which runs on a MASPAR 
computer. MPSRCH uses a Smith-Waterman algorithm to score sequences on a massively parallel 
computer. This approach improves ability to identify sequences that are distantly related matches, 
and is especially tolerant of small gaps and nucleotide sequence errors. Amino acid sequences 
encoded bv the provided polynucleotides can be used to search both protein and DNA databases. 

1 5 Incorporated herein by reference are all sequences that have been made public as of the filing date of 
this application by any of the DNA or protein sequence databases, including the patent databases 
(e.g.. GeneSeq). Also incorporated by reference are those sequences that have been submitted to 
these databases as of the filing date of the present application but not made public until after the 
filing date of the present application. 

20 Results of individual and query sequence alignments can be divided into three categories: 

high similarity, weak similarity, and no similarity. Individual alignment results ranging from high 
similarity to weak similarity provide a basis for determining polypeptide activity and/or structure. 
Parameters for categorizing individual results include: percentage of the alignment region length 
where the strongest alignment is found, percent sequence identity, and p value. The percentage of 

25 the alignment region length is calculated by counting the number of residues of the individual 
sequence found in the region of strongest alignment, e.g., contiguous region of the individual 
sequence that contains the greatest number of residues that are identical to the residues of the 
corresponding region of the aligned query sequence. This number is divided by the total residue 
length of the query sequence to calculate a percentage. For example, a query sequence of 20 amino 

10 acid residues might be aligned with a 20 amino acid region of an individual sequence. The 
individual sequence might be identical to amino acid residues 5. 9-15, and 17-19 of the query 
sequence. The region of strongest alignment is thus the region stretching from residue 9-19, an 11 
amino acid stretch. The percentage of the alignment region length is: 1 1 (length of the region of 
strongest alignment) divided by (query sequence length) 20 or 55%. 
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Percent sequence identity is calculated by counting the number of ammo acid matches 
between the query and individual sequence and dividing total number of matches by the number of 
residues of the individual sequences found in the region of strongest alignment Thus, the percent 
identity in the example above would be 10 matches divided by i 1 ammo acids, or approximately. 

5 90.9% 

P value is the probability that the alignment was produced by chance. For a single 
alignment, the p value can be calculated according to Karlin el al.. Proc. Natl. Acad Sc. (1990) 
,r:2264 and Karlin et aL Proc. Natl. Acad. Sc. ( 1993 ) 90. The p value of multiple alignments 
using the same query sequence can be calculated using an heuristic approach described in Altschul 
10 el al.. Nat. Genet (1994) 6:1 19. Alignment programs such as BLAST program can calculate the p 
value See also Altschul el al.. Nucleic Acids Res. (1997) 25:3389-3402. 

Another factor to consider for determining identity or similarity is the location of the 
similarity or .dentin. Strong local alignment can indicate similarity even if the length of alignment 
is short. Sequence identity scattered throughout the length of the query sequence also can indicate a 
1 5 similarity between the query and profile sequences. The boundaries of the reg.on where the 

sequences align can be determined according to Doolittle. supra: BLAST 2.0 (see. e.g.. Altschul. et 
al. Nucleic Acids Res. ( 1 997) 25:3389-3402) or FAST programs: or by determining the area where 
sequence identity is highest. 

Hiah Similarity. In general, in alignment results considered to be of high similarity, the 
20 percent of the alignment reg.on length is typically at least about 55% of total length query sequence: 
more typically, at least about 58%: even more typically: at least about 60% of the total residue 
length of the query sequence. Usually, percent length of the alignment region can be as much as 
about 62%: more usually, as much as about 64%: even more usually, as much as about 66% 
Further, for high similarity, the region of alignment, typically, exhibits at least about 75% of 
25 sequence identity: more typically, at least about 78%: even more typically: at least about 80% 

sequence identity. Usually, percent sequence identity can be as much as about 82%: more usually, 
as much as about 84%; even more usually, as much as about 86%. 

The p value is used in conjunction with these methods. If high similarity is found, the query 
sequence is considered to have high similarity with a profile sequence when the p value is less than 
30 or equal to about 1 f\ more usually, less than or equal to about ! 0" 3 . even more usually: less than or 
equal to about I0" 4 . More typically, the p value is no more than about 10°: more typically: no more 
than or equal to about 10''°: even more typically: no more than or equal to about 10"' 5 for the query 
sequence to be considered high similarity. 
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Weak Similarity. In general, where alignment results considered to be ot weak similarity, 
there is no minimum percent length of the alignment region nor minimum length of alignment. A 
better showing of weak similarity is considered when the region of alignment is. typically, at least 
about 1 5 ammo acid residues in length; more typically, at least about 20: even more typically: at 
5 least about 25 ammo acid residues in length. Usually, length of the alignment region can be as much 
as about 30 amino acid residues: more usually, as much as about 40: even more usually, as much as 
about 60 ammo acid residues. Further, for weak similarity, the region of alignment, typically, 
exhibits at least about 35% of sequence identity: more typically, at least about 40%: even more 
typically: at least about 45% sequence identity. Usually, percent sequence identity can be as much 
10 as about 50%: more usually, as much as about 55%: even more usually, as much as about 60% 

If low similarity is found, the query sequence is considered to have weak similarity with a 

profile sequence when the p value is usually less than or equal to about 1 0 ~: more usually: less than 

or equal to about 10°: even more usually; less than or equal to about 10 4 . More typically, the p 

value is no more than about 10°: more usually: no more than or equal to about 10 1 \ even more 

15 usually: no more than or equal to about 10" 15 for the query sequence to be considered weak 
similarity 

S imilarity Determined bv Sequence Identity Alone. Sequence identity alone can be used to 
determine similarity of a query sequence to an individual sequence and can indicate the activity of 
the sequence. Such an alignment, preferably, permits gaps to align sequences. Typically, the query 

20 sequence is related to the profile sequence if the sequence identity over the entire query sequence is 
at least about 1 5%: more typically, at least about 20%: even more typically, at least about 25%: even 
more typically, at least about 50%. Sequence identity alone as a measure of similarity is most useful 
when the query sequence is usually, at least 80 residues in length; more usually, 90 residues; even 
more usually, at least 95 amino acid residues in length. More typically, similarity can be concluded 

25 based on sequence identity alone when the query sequence is preferably 100 residues in length: more 
preferably. 120 residues in length; even more preferably, 150 amino acid residues in length. 

Alignments with Profile and Multiple Aligned Sequences. Translations of the provided 
polynucleotides can be aligned with amino acid profiles that define either protein families or 
common motifs. Also, translations of the provided polynucleotides can be aligned to multiple 

30 sequence alignments (MSA) comprising the polypeptide sequences of members of protein families 
or motifs Similarity or identity with profile sequences or MSAs can be used to determine the 
activity of the gene products (e.g., polypeptides) encoded by the provided polynucleotides or 
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corresponding cDNA or genes. For example, sequences that show an identity or similarity with a 
chemokine profile or MSA can exhibit chemokine activities. 

Profiles can designed manually by ( 1 ) creating an MSA. which is an alignment of the amino 
acid sequence of members that belong to the family and (2) constructing a statistical representation 
5 of the alignment. Such methods are described, for example, in Birney et aL. Nucl Acid Res. (1996) 
24(14): 2730-2739. MSAs of some protein families and motifs are publicly available. For example, 
hup://uenome.\vustl-edu/Pfam/ includes MSAs of 547 different families and motifs. These MSAs 
are described also in Sonnhammer el aL. Proteins ( 1 997) 28: 405-420. Other sources over the world 
wide web include the site at h 1 1 p : j 7 w w w . e m b I - h e i de I be r t> . d e/a r » os; a 1 i / a I i . h t m 1 ; alternatively, a 
0 message can be sent to AjJ a FMBI.-HFIDFLBHRG.D1' for the information. A brief description of 
these MSAs is reported in Pascarella et aL. Prof. Eng. (1996) 9^:249-25 1 . Techniques for building 
profiles from MSAs are described in Sonnhammer et aL. supra: Birney et aL. supra: and "Computer 
Methods for Macromolecular Sequence Analysis." Methods in Enzvmolugv (1996) 266. Doolirtle. 
Academic Press, Inc.. San Diego. California. USA. 
15 Similarity between a query sequence and a protein family or motif can be determined by (a) 

comparing the query sequence against the profile and/or (b) aligning the query sequence with the 
members of the family or motif Typically, a program such as Searchwise is used to compare the 
query sequence to the statistical representation of the multiple alignment, also known as a profile 
(see Birnev et aL. supra). Other techniques to compare the sequence and profile are described in 
20 Sonnhammer et aL. supra and Doolittle, supra. 

Next, methods described by Feng et aL. J. Mai Evol. (1987)25:351 and Higgins et aL. 
CABIOS (1989) 5:151 can be used align the query sequence with the members of a family or motif 
also known as a MSA. Sequence alignments can be generated using any of a variety of software 
tools. Examples include PileUp. which creates a multiple sequence alignment, and is described in 
25 Feng et aL. J. Mo!. Evol. (1987) 25:351. Another method, GAP. uses the alignment method of 

Needleman et aL. J. MoL Biol. (1970) 48:443. GAP is best suited for global alignment of sequences. 
A third method. BestFit functions by inserting gaps to maximize the number of matches using the 
local homology algorithm of Smith*?/ al. Adv. Appl. Math. (1981)2:482. In general, the following 
factors are used to determine if a similarity between a query sequence and a profile or MSA exists: 
30 (1) number of conserved residues found in the query sequence. (2) percentage of conserved residues 
found in the query sequence. (3) number of frameshifts, and (4) spacing between conserved residues. 

Some alignment programs that both translate and align sequences can make any number of 
frameshifts when translating the nucleotide sequence to produce the best alignment. The fewer 
frameshifts needed to produce an alignment, the stronger the similarity or identity between the query 
35 and profile or MSAs. For example, a weak similarity resulting from no frameshifts can be a better 
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indication of activity or structure of a query sequence, than a strong similarity resulting from two 
frameshifts. Preferably, three or fewer frameshifts are found in an alignment: more preferably two 
or fewer frameshifts; even more preferably, one or fewer frameshifts: even more preferably, no 
frameshifts are found in an alignment of query and profile or MS As. 
5 Conserved residues are those amino acids found at a particular position in all or some of the 

family or motif members. Alternatively, a position is considered conserved if only a certain class of 
amino acids is found in a particular position in all or some of the family members. For example, the 
N-terminal position can contain a positively charged amino acid, such as lysine, argmme. or 
histidine. 

]{) Typically, a residue of a poly peptide is conserved when a class of amino acids or a single 

amino acid is found at a particular position in at least about 40% of all class members: more 
tvpicallv. at least about 50%: even more typically, at least about b0% of the members. Usually, a 
residue is conserved w hen a class or single amino acid is found in at least about 70% of the members 
of a famiiv or motif; more usually, at least about 80%: even more usually, at least about 90%; even 

1 5 more usually, at least about 95%. 

A residue is considered conserved when three unrelated amino acids are found at a particular 
position in the some or all of the members; more usually, two unrelated amino acids. These residues 
are conserved when the unrelated amino acids are found at particular positions in at least about 40% 
of all class member: more typically, at least about 50%: even more typically, at least about 60% of 

20 the members. Usually, a residue is conserved when a class or single amino acid is found in at least 
about 70% of the members of a family or motif: more usually, at least about 80%; even more 
usually, at least about 90%: even more usually, at least about 95%. 

A query sequence has similarity to a profile or MSA when the query sequence comprises at 
least about 25% of the conserved residues of the profile or MSA; more usually, at least about 30%: 

25 even more usually: at least about 40%. Typically, the query sequence has a stronger similarity to a 
profile sequence or MSA when the query sequence comprises at least about 45% of the conserved 
residues of the profile or MSA: more typically, at least about 50%: even more typically: at least 
about 55%. 

Identification of Secreted & Membrane-Bound Polypeptides 

10 Both secreted and membrane-bound polypeptides of the present invention are of particular 

interest. For example, levels of secreted polypeptides can be assayed in body fluids that are 
convenient, such as blood, plasma, serum, and other body fluids such as urine, prostatic fluid and 
semen. Membrane-bound polypeptides are useful for constructing vaccine antigens or inducing an 
immune response. Such antigens would comprise all or part of the extracellular region of the 

35 membrane-bound polypeptides. Because both secreted and membrane-bound polypeptides comprise 
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a fragment of contiguous hydrophobic amino acids, hydrophobicity predicting algorithms can be 
used to identify such polypeptides. 

A signal sequence is usually encoded by both secreted and membrane-bound polypeptide 
genes to direct a polypeptide to the surface of the cell. The signal sequence usually comprises a 
3 stretch of hydrophobic residues. Such signal sequences can fold into helical structures. Membrane- 
bound polypeptides typically comprise at least one transmembrane region that possesses a stretch of 
hydrophobic amino acids that can transverse the membrane. Some transmembrane regions also 
exhibit a helical structure. Hydrophobic fragments within a polypeptide can be identified by using 
computer algorithms. Such algorithms include Hopp & Woods. Proc. Sail. Acad. Set. USA (198!) 

10 "7?:3824-3828: Kyte & Doolittle, J. Moi Biol. (1982) 75 \- 105-132: and RAOAR algorithm. Degli 
Esposti el aL Eur. J. Biochem. (1990) 190: 207-219. 

Another method of identifying secreted and membrane-bound poi\ peptides is to translate 
the polynucleotides of the invention in all six frames and determine if at least 8 contiguous 
hydrophobic amino acids are present. Those translated polypeptides with at least 8: more typically. 

15 10; even more typically, 12 contiguous hydrophobic amino acids are considered to be either a 
putative secreted or membrane bound polypeptide. Hydrophobic amino acids include alanine, 
glycine, histidine. isoleucine. leucine, lysine, methionine, phenylalanine, proline, threonine, 
tryptophan, tyrosine, and valine 

Identification of the Function of an Expression Product of a Full-Length Gene 

20 Ribozymes. antisense constructs, and dominant negative mutants can be used to determine 

function of the expression product of a gene corresponding to a polynucleotide provided herein. 
These methods and compositions are particularly useful where the provided novel polynucleotide 
exhibits no significant or substantial homology to a sequence encoding a gene of known function. 
Antisense molecules and ribozymes can be constructed from synthetic polynucleotides. Typically. 

25 the phosphoramidite method of oligonucleotide synthesis is used. See Beaucage et aL. Tel. Lett. 
(1981) 22:1859 and USPN 4,668,777. Automated devices for synthesis are available to create 
oligonucleotides using this chemistry. Examples of such devices include Biosearch 8600. Models 
392 and 394 by Applied Biosystems. a division of Perkin-Elmer Corp.. Foster City, California. 
USA; and Expedite by Perceptive Biosystems. Framingham. Massachusetts, USA. Synthetic RNA. 

30 phosphate analog oligonucleotides, and chemically derivatized oligonucleotides can also be 
produced, and can be covalently attached to other molecules. RNA oligonucleotides can be 
synthesized, for example, using RNA phosphoramidites. This method can be performed on an 
automated synthesizer, such as Applied Biosystems, Models 392 and 394. Foster City, California, 
USA. 
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Phosphorothioate oligonucleotides can also be synthesized for antisense construction. A 
sulfurizina reagent, such as tetraethylthiruam disulfide (TETD) in acetonitrile can be used to convert 
the mternucleotide cyanoethyl phosphite to the phosphorothioate tnester within 15 minutes at room 
temperature. TETD replaces the iodine reagent, while all other reagents used for standard 
5 phosphoramidite chemistry remain the same. Such a synthesis method can be automated using 
Models 392 and 394 by Applied Biosystems. tor example. 

Oligonucleotides of up to 200 nt can be synthesized, more typically. 100 nt. more typically 
50 nt: even more typically 30 to 40 nt. These synthetic fragments can be annealed and ligated 
together to construct larger fragments. See, for example. Sambrook el ciL. supra Trans-cleaving 
0 catalytic RN As (ribozymes) arc RNA molecules possessing endoribonucleasc activity. Ribozymes 
are specifically designed for a particular target, and the target message must contain a specific 
nucleotide sequence. They are engineered to cleave any RNA species site-specifically in the 
background of cellular RNA. The cleavage event renders the mRNA unstable and prevents protein 
expression. Importantly, ribozymes can be used to inhibit expression of a gene of unknown 
15 function for the purpose of determining its function in an in vitro or in vivo context, by detecting 
the phenotvpic effect. One commonly used ribozyme motif is the hammerhead, for which the 
substrate sequence requirements are minimal. Design of the hammerhead ribozyme. as well as 
therapeutic uses of ribozymes. are disclosed in Usman el a!.. Current Opin. Struct. Biol. (1996) 
6:527. Methods for production of ribozymes. including hairpin structure ribozyme fragments. 
20 methods of increasing ribozyme specificity, and the like are known in the art. 

The hybridizing region of the ribozyme can be modified or can be prepared as a branched 
structure as described in Horn and Urdea. Nucleic Acids Res. ( 1 989) 1 7:6959. The basic structure of 
the ribozvmes can also be chemically altered in ways familiar to those skilled in the art. and 
chemically synthesized ribozymes can be administered as synthetic oligonucleotide derivatives 
25 modified by monomeric units. In a therapeutic context, liposome mediated delivery' of ribozymes 
improves cellular uptake, as described in Birikh et ai, Eur. J. Biochetn. (1997) 245\\. 

Antisense nucleic acids are designed to specifically bind to RNA. resulting in the formation 
of RNA- DN A or RNA-RNA hybrids, with an arrest of DNA replication, reverse transcription or 
messenger RNA translation. Antisense polynucleotides based on a selected polynucleotide sequence 
30 can interfere with expression of the corresponding gene. Antisense polynucleotides are typically 

generated within the cell by expression from antisense constructs that contain the antisense strand as 
the transcribed strand. Antisense polynucleotides based on the disclosed polynucleotides will bind 
and/or interfere with the translation of mRNA comprising a sequence complementary to the 
antisense polynucleotide. The expression products of control cells and cells treated with the 
35 antisense construct are compared to detect the protein product of the gene corresponding to the 

17 



WO 99/58675 PCT/US99/1 0602 



polynucleotide upon which the antisensc construct is based. The protein is isolated and identified 
using routine biochemical methods. 

Given the extensive background literature and clinical experience in antisense therapy, one 
skilled in the art can use selected polynucleotides of the invention as additional potential 
5 therapeutics. The choice of polynucleotide can be narrowed by first testing them for binding to "hot 
spot" regions of the genome of cancerous cells. If a polynucleotide is identified as binding to a "hot 
spot", testing the polynucleotide as an antisense compound in the corresponding cancer cells is 
warranted. 

As an alternative method for identifying function of the gene corresponding to a 
10 polynucleotide disclosed herein, dominant negative mutations are readily generated for 

corresponding proteins that are active as homomuitimers. A mutant polypeptide will interact with 
wild-tvpe polypeptides (made from the other allele) and form a non-functional multimcr. Thus, a 
mutation is in a substrate-binding domain, a catalytic domain, or a cellular localization domain. 
Preferably, the mutant polypeptide will be overproduced. Point mutations are made that have such 
1 5 an effect. In addition, fusion of different polypeptides of various lengths to the terminus of a protein 
can yield dominant negative mutants. General strategies are available for making dominant negative 
mutants (see, e.g., Herskow.tz. Nature {mi) 329:219). Such techniques can be used to create loss 
of function mutations, which are useful for determining protein function. 
Polypeptides and Variants Thereof 
20 The polypeptides of the invention include those encoded by the disclosed polynucleotides, 

as well as nucleic acids that, by virtue of the degeneracy of the genetic code, are not identical in 
sequence to the disclosed polynucleotides. Thus, the invention includes within its scope a 
polypeptide encoded by a polynucleotide having the sequence of any one of SEQ ID NOS: 1-2707 or 
a variant thereof. 

25 In general, the term "polypeptide" as used herein refers to both the full length polypeptide 

encoded by the recited polynucleotide, the polypeptide encoded by the gene represented by the 
recited polynucleotide, as well as portions or fragments thereof. "Polypeptides" also includes 
variants of the naturally occurring proteins, where such variants are homologous or substantially 
similar to the naturally occurring protein, and can be of an origin of the same or different species as 

.0 the naturally occurring protein (e.g., human, murine, or some other species that naturally expresses 
the recited polypeptide, usually a mammalian species). In general, variant polypeptides have a 
sequence that has at least about 80%, usually at least about 90%, and more usually at least about 
98% sequence identity with a differentially expressed polypeptide of the invention, as measured by 
BLAST 2.0 using the parameters described above. The variant polypeptides can be naturally or non- 
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naturally glycosylated, i.e.. the polypeptide has a glycosylation panern that differs from the 
dycosvlation pattern found in the corresponding naturally occurring protein. 

The invention also encompasses homologs of the disclosed pol\ peptides (or fragments 
t hereoO where the homologs are isolated from other species, i.e. other animal or plant species. 

5 where such homologs. usually mammalian species, e.g. rodents, such as mice, rats: domestic 
animals, e.g.* horse, cow. dog. cat; and humans. Ry "homolog" is meant a polypeptide having at 
least about 35%. usually at least about 40% and more usually at least about 60% amino acid 
sequence identity to a particular differentially expressed protein as identified above, where sequence 
identitv is determined using the BLAST 2.0 algorithm, with the parameters described supra. 

10 In general, the polypeptides of the subject invention are provided in a non-naturally 

occurring environment. e.g. are separated from their naturally occurring environment. In certain 
embodiments, the subject protein is present in a composition that is enriched for the protein as 
compared to a control. As such, purified polypeptide is provided, where by purified is meant that 
the protein is present in a composition that is substantially free of non-differentially expressed 

1 5 polypeptides, where by substantially free is meant that less than 90%, usually less than 60% and 
more usually less than 50% of the composition is made up of non-differentially expressed 
polypeptides. 

Also within the scope of the invention are variants: variants of polypeptides include 
mutants, fragments, and fusions. Mutants can include amino acid substitutions, additions or 

20 deletions. The amino acid substitutions can be conservative amino acid substitutions or substitutions 
to eliminate non-essential amino acids, such as to alter a glycosylation site, a phosphorylation site or 
an acetvlation site, or to minimize misfolding by substitution or deletion of one or more cysteine 
residues that are not necessary for function. Conservative amino acid substitutions are those that 
preserve the general charge, hydrophobicity/ hydrophilicity. and/or steric bulk of the amino acid 

25 substituted. Variants can be designed so as to retain or have enhanced biological activ ity of a 

particular region of the protein (e.g., a functional domain and/or. where the polypeptide is a member 
of a protein family, a region associated with a consensus sequence). Selection of amino acid 
alterations for production of variants can be based upon the accessibility (interior vs. exterior) of the 
amino acid (see. e.g., Go et al Int. J. Peptide Protein Res. (1 980) 75:21 1). the thermostability of the 

30 variant polypeptide (see. e g . Querol et al. Prof Em>. H996) 0:265). desired glycosylation sites 
(see. e.g., Olsen and Thomsen. J Gen. Microbiol. (1991 ) 737:579), desired disulfide bridges (see. 
e.g.. Clarke et al. Biochemistry (1993) 52:4322: and Wakarchuk et al. Protein Eng. (1994) 7:1379), 
desired metal binding sites (see. e.g.. Toma et al. Biochemistry ( 1991 ) 30:97. and Haezerbrouck et 
al. Protein Eng. (1993) 6:643). and desired substitutions with in proline loops (see. e.g., Masul et 
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nL. Appl Env Microbiol ( 1994) 60:3519). CYsteme-dcpleted muteins can be produced as disclosed 
in USPN 4.959.314. 

Variants also include fragments of the polypeptides disclosed herein, particularly 
biologically active fragments and/or fragments corresponding to functional domains. Fragments of 

5 interest will typically be at least about 10 aa to at least about 1 5 aa in length, usually at least about 
50 aa in length, and can be as long as 300 aa in length or longer, but will usually not exceed about 
1000 aa in length, where the fragment will have a stretch of amino acids that is identical to a 
polypeptide encoded by a polynucleotide having a sequence of any SEQ ID NOS: 1-2707. or a 
homolog thereof. The protein variants described herein are encoded by polynucleotides that are 

10 within the scope of the invention. The genetic code can be used to select the appropriate codons to 
construct the corresponding variants. 
Computer- Related Embodiments 

In general, a library of polynucleotides is a collection of sequence information, which 
information is provided in either biochemical form (e.g., as a collection of polynucleotide 

15 molecules), or in electronic form (e.g.. as a collection of polynucleotide sequences stored in a 
computer-readable form, as in a computer system and/or as part of a computer program). The 
sequence information of the polynucleotides can be used in a variety of ways, e.g.. as a resource for 
gene discovery, as a representation of sequences expressed in a selected cell type (e.g.. cell type 
markers), and/or as markers of a given disease or disease state. In general, a disease marker is a 

20 representation of a gene product that is present in all cells affected by disease cither at an increased 
or decreased level relative to a normal cell (e.g.. a cell of the same or similar type that is not 
substantially affected by disease). For example, a polynucleotide sequence in a library can be a 
polynucleotide that represents an mRNA. polypeptide, or other gene product encoded by the 
polynucleotide, that is either overexpressed or underexpressed in a breast ductal cell affected by 

25 cancer relative to a normal (i.e.. substantially disease-free) breast cell. 

The nucleotide sequence information of the library can be embodied in any suitable form, 
e.g.. electronic or biochemical forms. For example, a library of sequence information embodied in 
electronic form comprises an accessible computer data file (or, in biochemical form, a collection of 
nucleic acid molecules) that contains the representative nucleotide sequences of genes that are 

30 differentially expressed (e.g . overexpressed or underexpressed) as between, for example, i) a 

cancerous cell and a normal cell: ii) a cancerous cell and a dysplastic cell; iii) a cancerous cell and a 
cell affected by a disease or condition other than cancer; iv) a metastatic cancerous cell and a normal 
cell and/or non-metastatic cancerous cell; v) a malignant cancerous cell and a non-malignant 
cancerous cell (or a normal cell) and/or vi) a dysplastic cell relative to a normal cell. Other 

35 combinations and comparisons of cells affected by various diseases or stages of disease will be 
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readily apparent to the ordinarily skilled artisan. Biochemical embodiments of the library include a 
collection of nucleic acids that have the sequences of the genes in the library, where the nucleic 
acids can correspond to the entire gene in the library or to a fragment thereof, as described in greater 
detail below. 

The poly nucleotide libraries of the subject invention generally comprise sequence 
information of a plurality of polynucleotide sequences, where at least one of the polynucleotides has 
a sequence of any of SEQ ID NOS: 1-2707. By plurality is meant at least 2. usually at least 3 and 
can include up to all of SEQ ID NOS: 1 -2707. The length and number of polynucleotides in the 
library will vary with the nature of the library, e.g.. if the library is an oligonucleotide array, a cDNA 
arrav, a computer database of the sequence information, etc. 

Where the library is an electronic library, the nucleic acid sequence information can be 
present in a variety of media. "Media" refers to a manufacture, other than an isolated nucleic acid 
molecule, that contains the sequence information of the present invention. Such a manufacture 
provides the genome sequence or a subset thereof in a form that can be examined by means not 
directly applicable to the sequence as it exists in a nucleic acid. For example, the nucleotide 
sequence of the present invention, e.g. the nucleic acid sequences of any of the polynucleotides of 
SEQ ID NOS: 1-2707. can be recorded on computer readable media, e.g. any medium that can be 
read and accessed directly by a computer. Such media include, but are not limited to: magnetic 
storage media, such as a floppy disc, a hard disc storage medium, and a magnetic tape: optical 
storage media such as CD-ROM; electrical storage media such as RAM and ROM: and hybrids of 
these categories such as magnetic/optical storage media. One of skill in the art can readily 
appreciate how any of the presently known computer readable mediums can be used to create a 
manufacture comprising a recording of the present sequence information. "Recorded" refers to a 
process for storing information on computer readable medium, using any such methods as known in 
the art. Any convenient data storage structure can be chosen, based on the means used to access the 
stored information. A variety of data processor programs and formats can be used for storage, e.g. 
word processing text file, database format, etc. In addition to the sequence information, electronic 
versions of the libraries of the invention can be provided in conjunction or connection with other 
computer-readable information and/or other types of computer-readable files (e.g., searchable flies, 
executable flies, etc. including, but not limited to. for example, search program software, etc.). 

By providing the nucleotide sequence in computer readable form, the information can be 
accessed for a variety of purposes. Computer software to access sequence information is publicly 
available. For example, the gapped BLAST (Altschul et ai. Nucleic Acids Res. (1997) 25:3389- 
3402) and BLAZE (Brutiag et ai Comp. Chem. (1993) 1 7:203) search algorithms on a Sybase 
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system can be used to identity open reading frames (ORFs) within the genome that contain 
homology to ORFs from other organisms. 

As used herein, "a computer-based system" refers to the hardware means, software means, 
and data storage means used to analyze the nucleotide sequence information of the present 
5 invention. The minimum hardware of the computer-based sy stems of the present invention 

comprises a central processing unit (CPU), input means, output means, and data storage means. A 
skilled artisan can readily appreciate that any one of the currently available computer-based system 
are suitable for use in the present invention. The data storage means can comprise any manufacture 
comprising a recording of the present sequence information as described above, or a memory access 
] 0 means that can access such a manufacture. 

"Search means" refers to one or more programs implemented on the computer-based system, 
to compare a target sequence or target structural motif, or expression levels of a polynucleotide in a 
sample, with the stored sequence information. Search means can be used to identify fragments or 
regions of the genome that match a particular target sequence or target motif. A variety of known 
15 algorithms are publicly known and commercially available, e.g. Mac Pattern (EMBL). BLASTN and 
BLASTX (NCBI). A "target sequence" can be any polynucleotide or amino acid sequence of six or 
more continuous nucleotides or two or more ammo acids, preferably from about 10 to 100 ammo 
acids or from about 30 to 300 nt A variety of comparing means can be used to accomplish 
comparison of sequence information from a sample (e.g.. to analyze target sequences, target motifs. 
20 or relative expression levels) with the data storage means. A skilled artisan can readily recognize 
that any one of the publicly available homology search programs can be used as the search means 
for the computer based systems of the present invention to accomplish comparison of target 
sequences and motifs. Computer programs to analyze expression levels in a sample and in controls 
are also known in the art. 

25 A "target structural motif," or "target motif." refers to any rationally selected sequence or 

combination of sequences in which the sequence(s) are chosen based on a three-dimensional 
configuration that is formed upon the folding of the target motif, or on consensus sequences of 
regulatory or active sites. There are a variety of target motifs known in the art. Protein target motifs 
include, but arc not limited to. enzyme active sites and signal sequences. Nucleic acid target motifs 

?0 include, but are not limited to. hairpin structures, promoter sequences and other expression elements 
such as binding sites for transcription factors. 

A variety of structural formats for the input and output means can be used to input and 
output the information in the computer-based systems of the present invention. One format for an 
output means ranks the relative expression levels of different polynucleotides. Such presentation 
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provides a skilled artisan with a ranking of relative expression levels to determine a gene expression 
profile. . 

As discussed above, the "library" of the invention also encompasses biochemical libraries of 
the polynucleotides of SEQ ID NOS: 1-2707 . e.g., collections of nucleic acids representing the 

5 provided polynucleotides. The biochemical libraries can take a variety of forms, e.g.. a solution of 
cDNAs. a pattern of probe nucleic acids stably associated with a surface of a solid support (i.e.. an 
array ) and the like. Of particular interest are nucleic acid arrays in which one or more of SEQ ID 
NOS: 1-2707 is represented on the array. By array is meant a an article of manufacture that has at 
least a substrate with at least two distinct nucleic acid targets on one of its surfaces, where the 

1 0 number of distinct nucleic acids can be considerably higher, typically being at least i 0 nt. usually at 
least 20 nt and often at least 25 nt. A variety of different array formats have been developed and are 
know n to those of skill in the an. The arrays of the subject invention find use in a variety of 
applications, including gene expression analysis, drug screening, mutation analysis and the like, as 
disclosed in the above-listed exemplary patent documents. 

1 5 In addition to the above nucleic acid libraries, analogous libraries of polypeptides are also 

provided, where the where the polypeptides of the library will represent at least a portion of the 
polypeptides encoded by SEQ ID NOS: 1-2707. 
Utilities 

Use of Polynucleotide Probes in Mapping, and in Tissue Profiling 
20 Polynucleotide probes, generally comprising at least 1 2 contiguous nt of a polynucleotide as 

shown in the Sequence Listing, are used for a variety of purposes, such as chromosome mapping of 
the polynucleotide and detection of transcription levels. Additional disclosure about preferred 
regions of the disclosed polynucleotide sequences is found in the Examples. A probe that hy bridizes 
specifically to a polynucleotide disclosed herein should provide a detection signal at least 5-, 10-, or 
25 20-fold higher than the background hybridization provided with other unrelated sequences. 

Detection of Expression Levels. Nucleotide probes are used to detect expression of a gene 
corresponding to the provided polynucleotide. In Northern blots, mRNA is separated 
electrophoretically and contacted with a probe. A probe is detected as hybridizing to an mRNA 
species of a particular size. The amount of hybridization is quantitated to determine relative 
10 amounts of expression, for example under a particular condition. Probes are used for in situ 

hybridization to cells to detect expression. Probes can also be used in vivo for diagnostic detection 
of hybridizing sequences. Probes are typically labeled with a radioactive isotope. Other types of 
detectable labels can be used such as chromophores. fluors. and enzymes. Other examples of 
nucleotide hybridization assays are described in WO92/02526 and USPN 5.124,246. 



23 



WO 99/58675 



PCT7US99/I0602 



Alternatively, the Polymerase Chain Reaction (PGR) is another means for detecting small 
amounts of target nucleic acids (see. e.g., Mullis et a!.. \feth. Enzymol (1987) 755:335; USPN 
4.683,195: and USPN 4.683.202). Two primer polynucleotides nucleotides that hybridize with the 
tareet nucleic acids are used to prime the reaction. The primers can be composed of sequence within 
5 or 3' and 5' to the polynucleotides of the Sequence Listing. Alternatively, if the primers are 3' and 5' 
to these polynucleotides, they need not hybridize to them or the complements. After amplification 
of the target with a thermostable polymerase, the amplified target nucleic acids can be detected by 
methods known in the art. e.g.. Southern blot. mRNA or cDNA can also be detected by traditional 
blotting techniques (e.g.. Southern blot. Northern blot, etc.) described in Sambrook et ai. 
10 "Molecular Cloning: A Laboratory Manual" (New York. Cold Spring Harbor Laboratory. 1989) 
(e.c, without PCR amplification). In general. mRNA or cDNA generated from mRNA using a 
polymerase enzyme can be purified and separated using gel electrophoresis, and transferred to a 
solid support, such as nitrocellulose. The solid support is exposed to a labeled probe, washed to 
remove any unhybridized probe, and duplexes containing the labeled probe are detected. 
1 5 Mapping. Polynucleotides of the present invention can be used to identify a chromosome on 

which the corresponding gene resides. Such mapping can be useful in identifying the function of the 
poiynucleotide-related gene by its proximity to other genes with known function. Function can also 
be assigned to the polynucleotide-related gene when particular syndromes or diseases map to the 
same chromosome. For example, use of polynucleotide probes in identification and quantification 
20 of nucleic acid sequence aberrations is described in USPN 5.783.387. An exemplary mapping 
method is fluorescence in situ hybridization (FISH), which facilitates comparative genomic 
hybridization to allow total genome assessment of changes in relative copy number of DNA 
sequences (see. e.g.. Valdes et ai , Methods in Molecular Biology ( 1 997) 68: 1 ). Polynucleotides 
can also be mapped to particular chromosomes using, for example, radiation hybrids or 
25 chromosome-specific hybrid panels. See Leach et ai. Advances in Genetics, (1995) 55:63-99: 
Walter et ai, Nature Genetics (1994) 7:22: Walter and Goodfellow, Trends in Genetics ( 1992) 
9:352. Panels for radiation hybrid mapping are available from Research Genetics. Inc.. Huntsville, 
Alabama. USA. Databases for markers using various panels are available via the world wide web at 
hnp:/F/shgc-www.stanford.edu; and htt p:/7w\v\v-oenome.\vi.mit.edii/c<zi-b iiv'contio/rhmapper.pl. The 
10 statistical program RHMAP can be used to construct a map based on the data from radiation 

hybridization with a measure of the relative likelihood of one order versus another. RHMAP is 
available via the world wide web at http://www.sph.umich.edu/group/statgen/software. In addition, 
commercial programs are available for identifying regions of chromosomes commonly associated 
with disease, such as cancer. 
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Tissue Typing or Profiling, Expression of specific mRNA corresponding to the provided 
polynucleotides can vary in different cell types and can be tissue-specific. This variation of mRNA 
levels in different cell types can be exploited with nucleic acid probe assays to determine tissue 
types. For example. PCR. branched UNA probe assays, or blotting techniques utilizing nucleic acid 

5 probes substantially identical or complementary to polynucleotides listed in the Sequence Listing 
can determine the presence or absence of the corresponding cDNA or mRNA. 

Tissue typing can be used to identify the developmental organ or tissue source of a 
metastatic lesion by identifying the expression of a particular marker of that organ or tissue. If a 
polynucleotide is expressed only in a specific tissue type, and a metastatic lesion is found to express 

10 that polynucleotide, then the developmental source of the lesion has been identified. Expression ot a 
particular polynucleotide can be assayed by detection of either the corresponding mRNA or the 
protein product. As would be readily apparent to any forensic scientist, the sequences disclosed 
herein are useful in differentiating human tissue from non-human tissue. In particular, these 
sequences are useful to differentiate human tissue from bird, reptile, and amphibian tissue, for 

15 example. 

Use of Polymorphisms. A polynucleotide of the invention can be used in forensics. genetic 
analysis, mapping, and diagnostic applications where the corresponding region of a gene is 
polymorphic in the human population. Any means for detecting a polymorphism in a gene can be 
used, including, but not limited to electrophoresis of protein polymorphic variants, differential 
20 sensitivity to restriction enzyme cleavage, and hybridization to allele-specific probes. 

Antibody Production 

Expression products of a polynucleotide of the invention, as well as the corresponding 
mRNA. cDNA. or complete gene, can be prepared and used for raising antibodies for experimental, 
diagnostic, and therapeutic purposes. For polynucleotides to which a corresponding gene has not 

25 been assigned, this provides an additional method of identifying the corresponding gene. The 

polynucleotide or related cDNA is expressed as described above, and antibodies are prepared. These 
antibodies are specific to an epitope on the polypeptide encoded by the polynucleotide, and can 
precipitate or bind to the corresponding native protein in a cell or tissue preparation or in a cell-free 
extract of an in vitro expression system. 

30 Methods for production of antibodies that specifically bind a selected antigen are well 

known in the art. Immunogens for raising antibodies can be prepared by mixing a polypeptide 
encoded by a polynucleotide of the invention with an adjuvant, and/or by making fusion proteins 
with larger immunogenic proteins. Polypeptides can also be covaiently linked to other larger 
immunogenic proteins, such as keyhole limpet hemocyanin. Immunogens are typically administered 

35 intradermally, subcutaneously. or intramuscularly to experimental animals such as rabbits, sheep. 
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and mice, to generate antibodies. Monoclonal antibodies can be Monoclonal antibodies can be 
venerated by isolating spleen cells and fusing myeloma cells to form hybridomas. Alternatively, the 
selected polynucleotide is administered directly, such as by intramuscular injection, and expressed in 
vivo. The expressed protein generates a variety of protein-specific immune responses, including 

5 production of antibodies, comparable to administration of the protein. 

Preparations of polyclonal and monoclonal antibodies specific for polypeptides encoded by 
a selected polynucleotide are made using standard methods known in the art. The antibodies 
specifically bind to epitopes present in the polypeptides encoded by polynucleotides disclosed in the 
Sequence Listing. Typically, at least 6. 8. 10. or 12 contiguous ammo acids are required to form an 

10 epitope. Epitopes that involve non-contiguous amino acids may require a longer polypeptide, e.g.. at 
least 15. 25. or 50 amino acids. Antibodies that specifically bind to human polypeptides encoded by 
the provided polypeptides should provide a detection signal at least 5-. 1 0-. or 20-fold higher than a 
detection signal provided with other proteins when used in Western blots or other immunochemical 
assays. Preferably, antibodies that specifically polypeptides of the invention do not bind to other 

15 proteins in immunochemical assays at detectable levels and can immunoprecipitate the specific 
polypeptide from solution. 

The invention also contemplates naturally occurring antibodies specific for a polypeptide of 
the invention. For example, serum antibodies to a polypeptide of the invention in a human 
population can be purified by methods well known in the art. e.g.. by passing antiserum over a 

20 column to which the corresponding selected polypeptide or fusion protein is bound. The bound 
antibodies can then be eluted from the column, for example using a buffer with a high salt 
concentration. 

in addition to the antibodies discussed above, the invention also contemplates genetically 
engineered antibodies, antibody derivatives (e.g.. single chain antibodies, antibody fragments (e.g.. 
25 Fab. etc.)). according to methods well known in the art. 

Polynucleotides or Arrays for Diagnostics 

Polynucleotide arrays provide a high throughput technique that can assay a large number of 
polynucleotide sequences in a sample. This technology can be used as a diagnostic and as a tool to 
test for differential expression, e.g., to determine function of an encoded protein. Arrays can be 

30 created bv spotting polynucleotide probes onto a substrate {e.g.. glass, mtrocelllose. etc.) in a two- 
dimensional matrix or array having bound probes. The probes can be bound to the substrate by either 
covalent bonds or by non-specific interactions, such as hydrophobic interactions. Samples of 
polynucleotides can be detectably labeled (e.g.. using radioactive or fluorescent labels) and then 
hybridized to the probes. Double stranded polynucleotides, comprising the labeled sample 

35 polynucleotides bound to probe polynucleotides, can be detected once the unbound portion of the 
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sample is washed awa> Techniques for constructing arrays and methods of using these arrays are 
described in EP 799 897: WO 97/29212: WO 97/273 17: EP 785 280: WO 97/02357: USPN 
5.593.839: USPN 5.578.832: EP 728 520: USPN 5.599.695: LP 721 016: USPN 5.556.752: WO 
95/22058: and USPN 5.631.734. Arrays can be used to. for example, examine differential 
expression of genes and can be used to determine gene function For example. arra>s can be used to 
detect differential expression of a polynucleotide between a test cell and control cell [e.g.. cancer 
cells and normal cells). For example, high express.on of a particular message in a cancer cell, which 
is not observed in a corresponding normal cell, can indicate a cancer specific gene product. 
Exemplary uses of arrays are further described in. for example. Pappalarado el al.Sem. Radiation 
Oncol. (1998) 8:2] 7; and Ramsay Nature Biotechnol. (1998) 76:40. 
Differential Expression in Diagnosis 

The polynucleotides of the invention can also be used to detect differences in expression 
levels between two cells, e.g.. as a method to identify abnormal or diseased tissue in a human. For 
polynucleotides corresponding to profiles of protein fam.lies. the choice of tissue can be selected 
according to the putative biological function. In general, the expression of a gene corresponding to a 
specific polynucleotide is compared between a first tissue that is suspected of being diseased and a 
second, normal tissue of the human. The tissue suspected of being abnormal or diseased can be 
derived from a different tissue type of the human, but preferably it is derived from the same tissue 
type; for example an intestinal polyp or other abnormal growth should be compared with normal 
intestinal tissue. The normal tissue can be the same tissue as that of the test sample, or any normal 
tissue of the patient, especially those that express the polynucleotide-related gene of interest (e.g., 
brain, thymus, testis, heart, prostate, placenta, spleen, small intestine, skeletal muscle, pancreas, and 
the mucosal lining of the colon). A difference between the polynucleotide-related gene. mRNA. or 
protein in the two tissues which are compared, for example in molecular weight, ammo acid or 
nucleotide sequence, or relative abundance, indicates a change in the gene, or a gene which regulates 
it, in the tissue of the human that was suspected of being diseased. Examples of detection of 
differential expression and its use in diagnosis of cancer are described in USPNs 5.688.641 and 
5.677.125. 

A genetic predisposition to disease in a human can also be detected by comparing 
expression levels of an mRNA or protein corresponding to a polynucleotide of the invention in a 
fetal tissue with levels associated in normal fetal tissue. Fetal tissues that are used for this purpose 
include, but are not limited to. amniotic fluid, chorionic villi, blood, and the blastomere of an in 
vitro-fertilized embryo. The comparable normal polynucleotide-related gene is obtained from any 
tissue. The mRNA or protein is obtained from a normal tissue of a human in which the 
polynucleotide-related gene is expressed. Differences such as alterations in the nucleotide sequence 
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or size of ihe same product of the fetal polynucleotide-relaied gene or mRNA. or alterations in the 
molecular weight, amino acid sequence, or relative abundance of fetal protein, can indicate a 
germline mutation in the polynucleotide-related gene of the fetus, which indicates a genetic 
predisposition to disease. In general, diagnost.c. prognostic, and other methods of the invention 
5 based on differential expression involve detection of a level or amount of a gene product, 
particularly a differentially expressed gene product, in a test sample obtained from a patient 
suspected of having or being susceptible to a disease (e.g.. breast cancer, lung cancer, colon cancer 
and/or metastatic forms thereof), and comparing the detected levels to those levels found in normal 
cells (e.g., cells substantially unaffected by cancer) and/or other control cells (e.g.. to differentiate a 
1 0 cancerous cell from a cell affected by dysplasia). Furthermore, the severity of the disease can be 
assessed by comparing the detected levels of a differentially expressed gene product with those 
levels detected in samples representing the levels of differentially gene product associated with 
varying degrees of severity of disease. It should be noted that use of the term ••diagnostic" herein is 
not necessarily meant to exclude -prognostic - or -prognosis." but rather is used as a matter of 
15 convenience. 

The term -differentially expressed gene" is generally intended to encompass a 
polynucleotide that can. for example, include an open reading frame encoding a gene product (e.g., a 
polypeptide), and/or introns of such genes and adjacent 5' and 3' non-coding nucleottde sequences 
involved in the regulation of expression, up to about 20 kb beyond the coding region, but possibly 
20 further in either direction The gene can be introduced into an appropriate vector for 

extrachromosomal maintenance or for integration into a host genome. In general, a difference in 
expression level associated with a decrease in expression level of at least about 25%. usually at least 
about 50% to 75%. more usually at least about 90% or more is indicative of a differentially 
expressed gene of interest, i.e.. a gene that is underexpressed or down-regulated in the test sample 
25 relative to a control sample. Furthermore, a difference in expression level associated with an 

increase in expression of at least about 25%. usually at least about 50% to 75%. more usually at least 
about 90% and can be at least about 1 '/-fold, usually at least about 2-fold to about 10-fold, and can 
be about 100-fold to about 1 .000-fold increase relative to a control sample is indicative of a 
differentially expressed gene of interest, i.e.. an overexpressed or up-regulated gene. 
-; 0 "Differentially expressed polynucleotide" as used herein means a nucleic acid molecule 

(RNA or DNA) comprising a sequence that represents a differentially expressed gene. e.g.. the 
differentially expressed polynucleotide comprises a sequence (e.g., an open reading frame encoding 
a gene product) that uniquely identifies a differentially expressed gene so that detection of the 
differentially expressed polynucleotide in a sample is correlated with the presence of a differentially 
35 expressed gene in a sample. -Differentially expressed polynucleotides" is also meant to encompass 
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fragments of the disclosed polynucleotides, e.g.. fragments retaining biological activity, as well as 
nucleic acids homologous, substantially similar, or substantially identical (e.g.. having about 90% 
sequence identity) to the disclosed polynucleotides. 

"Diagnosis" as used herein generally includes determination of a subject's susceptibility to a 
5 disease or disorder, determination as to whether a subject is presently affected by a disease or 

disorder, as well as to the prognosis of a subject affected by a disease or disorder {e.g.. identification 
of pre-metastatic or metastatic cancerous states, stages of cancer, or responsiveness of cancer to 
therapy). The present invention particularly encompasses diagnosis of subjects in the context of 
breast cancer (e.g.. carcinoma in situ (e.g.. ductal carcinoma in situ), estrogen receptor (ER)-positive 
1 0 breast cancer. ER-negative breast cancer, or other forms and/or stages of breast cancer), lung cancer 
(e g.. small ceil carcinoma, non-small cell carcinoma, mesothelioma, and other forms and/or stages 
of lung cancer), and colon cancer (e.g., adenomatous polyp, colorectal carcinoma, and other forms 
and/or stages of colon cancer). 

"Sample' 1 or "biological sample" as used throughout here are generally meant to refer to 
IS samples of biological fluids or tissues, particularly samples obtained from tissues, especially from 
cells of the type associated with the disease for which the diagnostic application is designed (e.g.. 
ductal adenocarcinoma), and the like. "Samples" is also meant to encompass derivatives and 
fractions of such samples (e.g., cell lysates). Where the sample is solid tissue, the cells of the tissue 
can be dissociated or tissue sections can be analyzed. 
20 Methods of the subject invention useful in diagnosis or prognosis typically involve 

comparison of the abundance of a selected differentially expressed gene product in a sample of 
interest with that of a control to determine any relative differences in the expression of the gene 
product, where the difference can be measured qualitatively and/or quantitatively. Quantitation can 
be accomplished, for example, by comparing the level of expression product detected in the sample 
25 with the amounts of product present in a standard curve. A comparison can be made visually: by 
using a technique such as densitometry, with or without computerized assistance: by preparing a 
representative library of cDNA clones of mRNA isolated from a test sample, sequencing the clones 
in the library to determine that number of cDNA clones corresponding to the same gene product, and 
analyzing the number of clones corresponding to that same gene product relative to the number of 
30 clones of the same gene product in a control sample: or bv using an array to detect relative levels of 
hybridization to a selected sequence or set of sequences, and comparing the hybridization pattern to 
that of a control. The differences in expression are then correlated with the presence or absence of 
an abnormal expression pattern. A variety of different methods for determining the nucleic acid 
abundance in a sample are known to those of skill in the art (see, e.g., WO 97/27317). In general. 
35 diagnostic assays of the invention involve detection of a gene product of a the polynucleotide 
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sequence [e.g.. mRNA or polypeptide) that corresponds to a sequence of SEQ IF) NOS: I -2707 . The 
patient from whom the sample is obtained can be apparently healthy, susceptible to disease {e.g.. as 
determined by family history or exposure to certain environmental factors), or can already be 
identified as having a condition in which altered expression of a gene product of the invention is 
5 implicated. 

Diagnosis can be determined based on detected gene product expression levels of a gene 
product encoded by at least one. preferably at least two or more, at least 3 or more, or at least 4 or 
more of the polynucleotides having a sequence set forth in SEQ ID NOS: 1-2707 . and can involve 
detection of expression of genes corresponding to all of SEQ ID NOS: I -2707 and/or additional 

1 0 sequences that can serve as additional diagnostic markers and/or reference sequences. Where the 
diagnostic method is designed to detect the presence or susceptibility of a patient to cancer, the 
assay preferabiv involves detection of a gene product encoded by a gene corresponding to a 
polynucleotide that is differentially expressed in cancer. Examples of such differentially expressed 
polynucleotides are described in the Examples below. Given the provided polynucleotides and 

15 information regarding their relative expression levels provided herein, assays using such 

polvnucleotides and detection of their expression levels in diagnosis and prognosis will be readily 
apparent to the ordinarily skilled artisan. 

Anv of a variety of detectable labels can be used in connection with the various 
embodiments of the diagnostic methods of the invention. Suitable detectable labels include 

20 fluorochromes.(£.g. fluorescein isothiocyanate (F1TC), rhodamine. Texas Red. phycoerythrin. 
allophycocyanin, 6-carboxyfluorescein (6-FAM). 2\7 , -dimethoxy-4\5*-dichloro-6- 
carboxyfluorescein. 6-carboxy-X-rhodamine (ROX), 6-carboxy-2\4\7\4.7-hexachlorofIuorescein 
(HEX). 5-carboxyfluorescein (5-FAM) or N.N.N'.N , -tetramethyi-6-carboxyrhodamine (TAMRA)), 
radioactive labels, (e.g. 32 P. 35 S. 3 H. etc.), and the like. The detectable label can involve a two 

25 stage systems (e.g.. biotin-avidin. hapten-anti-hapten antibody, etc. ) 

Reagents specific for the polynucleotides and polypeptides of the invention, such as 
antibodies and nucleotide probes, can be supplied in a kit for detecting the presence of an expression 
product in a biological sample. The kit can also contain buffers or labeling components, as well as 
instructions for using the reagents to detect and quantify expression products in the biological 

30 sample. Exemplarv embodiments of the diagnostic methods of the invention are described below in 
more detail. 

Polypeptide detection in diagnosis. In one embodiment, the test sample is assayed for the 
level of a differentially expressed polypeptide. Diagnosis can be accomplished using any of a 
number of methods to determine the absence or presence or altered amounts of the differentially 
35 expressed polypeptide in the test sample. For example, detection can utilize staining of cells or 

30 



WO 99/58675 



PCT/US99/10602 



histological sections with labeled antibodies, performed in accordance with conventional methods. 
Cells can be permeabiiized to stain cytoplasmic molecules. In general, antibodies that specifically 
bind a differentially expressed polypeptide of the invention are added to a sample, and incubated for 
a period of time sufficient to allow binding to the epitope, usually at least about 10 minutes. The 

5 antibody can be detectably labeled for direct detection [e.g.. using radioisotopes, enzymes. 

fluoresces, chemiluminescers. and the like), or can be used in conjunction with a second stage 
antibodv or reagent to detect binding {e.g.. biotin with horseradish peroxidase-conjugated avidin. a 
secondary antibody conjugated to a fluorescent compound, e.g. fluorescein, rhodamine. Texas red. 
etc.). The absence or presence of antibody binding can be determined by various methods, including 

10 flow cytometry of dissociated cells, microscopy, radiography, scintillation counting, etc. Any 
suitable alternative methods can of qualitative or quantitative detection of levels or amounts of 
differentially expressed polypeptide can be used, for example EL1SA. western blot, 
immunoprecipitation. radioimmunoassay, etc. 

mKNA detection. The diagnostic methods of the invention can also or alternatively involve 

1 5 detection of mRNA encoded by a gene corresponding to a differentially expressed polynucleotides 
of the invention. Any suitable qualitative or quantitative methods known in the an for detecting 
specific mRNAs can be used. mRNA can be detected by, for example, in situ hybridization in tissue 
sections, by reverse transcriptase-PCR. or in Northern blots containing poly A+ mRNA. One of skill 
in the art can readily use these methods to determine differences in the size or amount of mRNA 

20 transcripts between two samples. mRNA expression levels in a sample can also be determined by 
generation of a library of expressed sequence tags (ESTs) from the sample, where the EST library is 
representative of sequences present in the sample (Adams, et al., (\99\) Science 252:1651). 
Enumeration of the relative representation of ESTs within the library can be used to approximate the 
relative representation of the gene transcript within the starting sample. The results of EST analysis 

25 of a test sample can then be compared to EST analysis of a reference sample to determine the 

relative expression levels of a selected polynucleotide, particularly a polynucleotide corresponding 
to one or more of the differentially expressed genes described herein. Alternatively, gene expression 
in a test sample can be performed using serial analysis of gene expression (SAGE) methodology 
(e.g., Velculescu et al.. Science (1995) 270:484) or differential display (DD) methodology (see, e.g., 

30 U.S. 5.776,683: and U.S. 5.807.680). 

Alternatively, gene expression can be analyzed using hybridization analysis. 
Oligonucleotides or cDNA can be used to selectively identify or capture DNA or RNA of specific 
sequence composition, and the amount of RNA or cDN A hybridized to a known capture sequence 
determined qualitatively or quantitatively, to provide information about the relative representation of 

35 a particular message within the pool of cellular messages in a sample. Hybridization analysis can be 
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designed to allow tor concurrent screening of the relative expression of hundreds to thousands of 
genes by using, for example, array-based technologies having high density formats, including filters, 
microscope slides, or microchips, or solution-based technologies that use spectroscopic analysis 
(e.g.. mass spectrometry). One exemplary use of arrays in the diagnostic methods of the invention is 
> described below in more detail. 

Use of a single gene in diagnostic applications. The diagnostic methods of the invention can 
focus on the expression of a single differentially expressed gene. For example, the diagnostic 
method can involve detecting a differentially expressed gene, or a polymorphism of such a gene 
(e.g.. a polymorphism in an coding region or control region), that is associated with disease. 
0 Disease-associated polymorphisms can include deletion or truncation of the gene, mutations that 
alter expression level and/or affect activity of the encoded protein, etc. 

A number of methods are available for analyzing nucleic acids for the presence of a specific 
sequence, e.g. a disease associated polymorphism. Where large amounts of DNA are available, 
genomic DNA is used directly. Alternatively, the region of interest is cloned into a suitable vector 
5 and grown in sufficient quantity for analysis. Cells that express a differentially expressed gene can 
be used as a source of mRNA. which can be assayed directly or reverse transcribed into cDNA for 
analysis. The nucleic acid can be amplified by conventional techniques, such as the polymerase 
chain reaction (PCR). to provide sufficient amounts for analysis, and a detectable label can be 
included in the amplification reaction (e.g., using a detectably labeled primer or detectably labeled 
20 oligonucleotides) to facilitate detection. Alternatively, various methods are also known in the art that 
utilize oligonucleotide ligation as a means of detecting polymorphisms, see e.g.. Riley et uL Nuel. 
Acids Res. (1990) M:2887: and Delahunty et aL.Am.J. Hum. Genet. (1996) 58: 1239. 

The amplified or cloned sample nucleic acid can be analyzed by one of a number of methods 
known in the art. The nucleic acid can be sequenced by dideoxy or other methods, and the sequence 
25 of bases compared to a selected sequence, e.g.. to a wild-type sequence. Hybridization with the 
polymorphic or variant sequence can also be used to determine its presence in a sample (e.g.. by 
Southern blot, dot blot. etc. ). The hybridization pattern of a polymorphic or variant sequence and a 
control sequence to an array of oligonucleotide probes immobilized on a solid support, as described 
in US 5.445.934. or in WO 95/35505. can also be used as a means of identifying polymorphic or 
30 variant sequences associated with disease. Single strand conformational polymorphism (SSCP) 

analysis, denaturing gradient gel electrophoresis (DGGE). and heteroduplex analysis in gel matrices 
are used to detect conformational changes created by DNA sequence variation as alterations in 
electrophoretic mobility. Alternatively, where a polymorphism creates or destroys a recognition site 
for a restriction endonuclease. the sample is digested with that endonuclease, and the products size 
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fractionated to determine whether the fragment was digested. Fractionation is performed by gel or 
capillarv electrophoresis, particularly acrylamide or agarose gels. 

Screening for mutations in a gene can be based on the functional or antigenic characteristics 
of the protein. Protein truncation assays are useful in detecting deletions that can affect the 
5 biological activity of the protein. Various immunoassays designed to detect polymorphisms in 
proteins can be used in screening. Where many diverse genetic mutations lead to a particular 
disease phenotype, functional protein assays have proven to be effective screening tools. The 
activity of the encoded protein can be determined by comparison with the wild-type protein. 

Pattern matching in diagnosis using arravs. In another embodiment, the diagnostic and/or 
1 0 prognostic methods of the invention involve detection of expression of a selected set of genes in a 
test sample to produce a test expression pattern (TEP). The TEP is compared to a reference 
expression pattern (REP), which is generated by detection of expression of the selected set of genes 
in a reference sample (e.g.. a positive or negative control sample ). The selected set of genes 
includes at least one of the genes of the invention, which genes correspond to the polynucleotide 
1 5 sequences of SEQ ID NOS: I -2707 . Of particular interest is a selected set of genes that includes 
gene differentially expressed in the disease for which the test sample is to be screened. 

"Reference sequences" or "reference polynucleotides" as used herein in the context of 
differential gene expression analysis and diagnosis/prognosis refers to a selected set of 
polynucleotides, which selected set includes at least one or more of the differentially expressed 
20 polynucleotides described herein. A plurality of reference sequences, preferably comprising 

positive and negative control sequences, can be included as reference sequences. Additional suitable 
reference sequences are found in GenBank. Unigene. and other nucleotide sequence databases 
(including, e.g.* expressed sequence tag (EST), partial, and full-length sequences). 

"Reference array" means an array having reference sequences for use in hybridization with a 
25 sample, where the reference sequences include all. at least one of. or any subset of the differentially 
expressed polynucleotides described herein. Usually such an array will include at least 3 different 
reference sequences, and can include any one or all of the provided differentially expressed 
sequences. Arrays of interest can further comprise sequences, including polymorphisms, of other 
genetic sequences, particularly other sequences of interest for screening for a disease or disorder 
30 (e g . cancer, dysplasia, or other related or unrelated diseases, disorders, or conditions). The 

oligonucleotide sequence on the array will usually be at least about 12 nt in length, and can be of 
about the length of the provided sequences, or can extend into the flanking regions to generate 
fragments of 100 nt to 200 nt in length or more. Reference arrays can be produced according to any 
suitable methods known in the art. For example, methods of producing large arrays of 
35 oligonucleotides are described in U.S. 5.134.854. and U.S. 5.445.934 using light-directed synthesis 
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techniques. Using a computer controlled system, a heterogeneous arra> ot" monomers is converted, 
through simultaneous coupling at a number of reaction sites, into a heterogeneous array of polymers. 
Alternatively, microarrays are generated by deposition of prc-synthesized oligonucleotides onto a 
solid substrate, for example as described in PCT published application no. WO 95/35505. 
5 a "reference expression pattern" or "REP" as used herein refers to the relative levels of 

expression of a selected set of genes, particularly of differentially expressed genes, that is associated 
with a selected cell type, e.g., a normal cell, a cancerous cell, a cell exposed to an environmental 
stimulus, and the like. A "test expression pattern" or "TEP" refers to relative levels of expression of 
a selected set of genes, particularly of differentially expressed genes, in a test sample (e.g.. a cell of 
10 unknown or suspected disease state, from which mRNA is isolated). 

REPs can be generated in a variety of ways according to methods well known in the art. For 
example. REPs can be generated by hybridizing a control sample to an array having a selected set of 
polynucleotides (particularly a selected set of differentially expressed polynucleotides), acquiring 
the hybridization data from the array, and storing the data in a format that allows for ready 
15 comparison of the REP with a TEP. Alternatively, all expressed sequences in a control sample can 
be isolated and sequenced, e.g., by isolating mRNA from a control sample, converting the mRNA 
into cDN A. and sequencing the cDNA. The resulting sequence information roughU or precisely 
reflects the identity and relative number of expressed sequences in the sample. The sequence 
information can then be stored in a format (e.g.. a computer-readable format) that allows for ready 
20 comparison of the REP with a TEP. The REP can be normalized prior to or after data storage. 

and/or can be processed to selectively remove sequences of expressed genes that are of less interest 
or that might complicate analysis (e.g., some or all of the sequences associated with housekeeping 
genes can be eliminated from REP data). 

TEPs can be generated in a manner similar to REPs. e.g., by hybridizing a test sample to an 
25 array having a selected set of polynucleotides, particularly a selected set of differentially expressed 
polynucleotides, acquiring the hybridization data from the array, and storing the data in a format that 
allows for ready comparison of the TEP with a REP. The REP and TEP to be used in a comparison 
can be generated simultaneously, or the TEP can be compared to previously generated and stored 
REPs. 

30 In one embodiment of the invention, comparison of a TEP with a REP involves hybridizing 

a test sample with a reference array, where the reference array has one or more reference sequences 
for use in hybridization with a sample. The reference sequences include all, at least one of. or any 
subset of the differentially expressed polynucleotides described herein. Hybridization data for the 
test sample is acquired, the data normalized, and the produced TEP compared with a REP generated 

35 using an array having the same or similar selected set of differentially expressed polynucleotides. 

34 



WO 99/58675 



PCT/US99/I0602 



Probes that correspond to sequences differentially expressed between the two samples will show 
decreased or increased hybridization efficiency for one of the samples relative to the other. 

Methods for collection of data from hybridization of samples with a reference arrays are 
well known in the art. For example, the polynucleotides of the reference and test samples can be 

5 venerated using a detectable fluorescent label, and hybridization of the polynucleotides in the 
samples detected bv scanning the microarrays for the presence of the detectable label using, for 
example, a microscope and light source for directing light at a substrate. A photon counter detects 
fluorescence from the substrate, while an x-y translation stage varies the location of the substrate. A 
confocal detection device that can be used in the subject methods is described in USPN 5,631.734. 

10 A scanning laser microscope is described in Shalon et aL Genome Res. (1996) 6:639. A scan, using 
the appropriate excitation line, is performed for each tluorophore used. The digital images 
venerated from the scan are then combined for subsequent analysis. For any particular array 
element, the ratio of the fluorescent signal from one sample (e.g., a test sample) is compared to the 
fluorescent signal from another sample (e.g., a reference sample), and the relative signal intensity 

15 determined. 

Methods for analyzing the data collected from hybridization to airays are well known in the 
art. For example, where detection of hybridization involves a fluorescent label, data analysis can 
include the steps of determining fluorescent intensity as a function of substrate position from the 
data collected, removing outliers, i.e. data deviating from a predetermined statistical distribution, 

20 and calculating the relative binding affinity of the targets from the remaining data. The resulting 

data can be displayed as an image with the intensity in each region varying according to the binding 
affinity between targets and probes. 

In general, the test sample is classified as having a gene expression profile corresponding to 
that associated with a disease or non-disease state by comparing the TEP generated from the test 

25 sample to one or more REPs generated from reference samples (e.g.. from samples associated with 
cancer or specific stages of cancer, dysplasia, samples affected by a disease other than cancer, 
normal samples, etc.). The criteria for a match or a substantial match between a TEP and a REP 
include expression of the same or substantially the same set of reference genes, as well as expression 
of these reference genes at substantially the same levels (e.g., no significant difference between the 

30 samples for a signal associated with a selected reference sequence after normalization of the 

samples, or at least no greater than about 25% to about 40% difference in signal strength for a given 
reference sequence, in general, a pattern match between a TEP and a REP includes a match in 
expression, preferably a match in qualitative or quantitative expression level, of at least one of. all or 
any subset of the differentially expressed genes of the invention. 
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Pattern matching can be performed manually, or can be performed using a computer 
program. Methods for preparation of substrate matrices [e.g.. arrays), design of oligonucleotides for 
use with such matrices. labeling of probes, hybridization conditions, scanning of hybridized 
matrices, and analysis of patterns generated, including comparison analysis, are described in. lor 
5 example, U.S. 5.800.992. 

Diagnosis. Prognosis and Management of Cancer 

The polynucleotides of the invention and their gene products are of particular interest as 
genetic or biochemical markers (e.g., in blood or tissues) that will detect the earliest changes along 
the carcinogenesis pathway and/or to monitor the efficacy of various therapies and preventive 

1 0 interventions. For example, the level of expression of certain polynucleotides can be indicative of a 
poorer prognosis, and therefore warrant more aggressive chemo- or radio-therapy for a patient or 
vice versa. The correlation of novel surrogate tumor specific features with response ro treatment and 
outcome in patients can define prognostic indicators that allow the design of tailored therapy based 
on the molecular profile of the tumor. These therapies include antibody targeting and gene therapy. 

13 Determining expression of certain polynucleotides and comparison of a patients profile with known 
expression in normal tissue and variants of the disease allows a determination of the best possible 
treatment for a patient, both in terms of specificity of treatment and in terms of comfort level of the 
patient. Surrogate tumor markers, such as polynucleotide expression, can also be used to better 
classify, and thus diagnose and treat, different forms and disease states of cancer. Two 

20 classifications widely used in oncology that can benefit from identification of the expression levels 
of the polynucleotides of the invention are staging of the cancerous disorder, and grading the nature 
of the cancerous tissue. 

The polynucleotides of the invention can be useful to monitor patients having or susceptible 
to cancer to detect potentially malignant events at a molecular level before they are detectable at a 

25 gross morphological level. Furthermore, a polynucleotide of the invention identified as important for 
one type of cancer can also have implications for development or risk of development of other types 
of cancer, e.g., where a poly nucleotide is differentially expressed across various cancer types. Thus, 
for example, expression of a polynucleotide that has clinical implications for metastatic colon cancer 
can also have clinical implications for stomach cancer or endometrial cancer. 

30 Staging. Staging is a process used by physicians to describe how advanced the cancerous 

state is in a patient. Staging assists the physician in determining a prognosis, planning treatment and 
evaluating the results of such treatment. Staging systems vary with the types of cancer, but generally 
involve the following k TNM" system: the type of tumor, indicated by T; whether the cancer has 
metastasized to nearby lymph nodes, indicated by N: and whether the cancer has metastasized to 

35 more distant parts of the body, indicated by M. Generally, if a cancer is only detectable in the area 
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of .he primary lesion without having spread to any lymph nodes it ,s called Stage I. If it has spread 
only to the closest lymph nodes, it is called Stage 11. In Stage 111. the cancer has generally spread to 
, he lymph nodes in near prox.m.ty to the sue of the primary lesion. Cancers that hav e spread to a 
distant pan of the body, such as the liver, bone, brain or other site, are Stage IV. the most advanced 
5 stage. 

The poly nucleotides of the mvention can facilitate fine-tuning of the staging process h> 
identifying markers for the aggresivity of a cancer, e.g. the metastat.c potent.al. as well as the 
presence in different areas of the body. Thus, a Stage II cancer with a polynucleotide signifying a 
high metastat.c potential cancer can be used to change a borderline Stage II tumor to a Stage III 
10 tumor, justifying more aggress.ve therapy. Conversely, the presence of a polynucleotide signifying 
a lower metastatic potential allows more conservative staging of a tumor. 

Grading of cancers. Grade is a term used to describe how closely a tumor resembles normal 
tissue of its same type. The m.croscopic appearance of a tumor is used to .dentify tumor grade based 
on parameters such as cell morphology, cellular organization, and other markers of differentiation. 
15 As a general rule, the grade of a tumor corresponds to its rate of growth or aggressiveness, with 
undifferentiated or high-grade tumors being more aggress.ve than well differentiated or low-grade 
tumors. The following gu.delines are generally used for grading tumors: 1 ) GX Grade cannot be 
assessed: 2) Gl Well differentiated: G2 Moderately well differentiated: 3) G3 Poorly differentiated: 
4) G4 Undifferentiated. The polynucleotides of the invent.on can be especially valuable in 
20 determining the grade of the tumor, as they not only can a.d in determining the differentiation status 
of the cells of a tumor, they can also identify factors other than differentiation that are valuable in 
determining the aggressiveness of a tumor, such as metastatic potential. 

HPierfon of lung cancer. The polynucleotides of the invent.on can be used to detect lung 
cancer in a subject. Although there are more than a dozen different kinds of lung cancer, the two 
23 mam ty pes of lung cancer are small cell and nonsmall cell, which encompass about 90% of all lung 
cancer cases. Small cell carcinoma (also called oat cell carcinoma) usually starts in one of the larger 
bronchial tubes, grows fairly rapidly, and is likely to be large by the time of diagnosis. Nonsmall 
cell lung cancer (NSCLC) is made up of three general subtypes of lung cancer. Epidermoid 
carcinoma (also called squamous cell carcinoma) usually starts in one of the larger bronchial tubes 
^0 and grows relativelv slowlv The size of these tumors can range from very small to quite large 
Adenocarcnoma starts growing near the outside surface of the lung and can vary in both size and 
growth rate. Some slowly growing adenocarc. nomas are described as alveolar cell cancer. Large 
cell carcinoma starts near the surface of the lung, grows rapidly, and the growth is usually fairly 
large when diagnosed. Other less common forms of lung cancer are carcinoid, cylindroma, 
35 mucoepidermoid. and malignant mesothelioma. 
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The polynucleotides of the invention, e.g.. polynucleotides differentially expressed in 
normal cells versus cancerous lung cells (e.g.. tumor cells of high or low metastatic potential) or 
between types of cancerous lung cells (e.g.. high metastatic versus low metastatic), can be used to 
distinguish types of lung cancer as well as identifying traits specific to a certain patient's cancer and 
selecting an appropriate therapy, for example, if the patient's biopsy expresses a polynucleotide 
that is associated with a low metastatic potential, it may justify leaving a larger portion of the 
patient's lung in surgen' to remove the lesion Alternatively, a smaller lesion with expression of a 
polynucleotide that is associated with high metastatic potential may justify a more radical removal 
of lung tissue and/or the surrounding lymph nodes, even if no metastasis can be identified through 
10 pathological examination. 

Detection of breast cancer. The majority of breast cancers are adenocarcinomas subtypes, 
which can be summarized as follows: 1) ductal carcinoma in situ (DOS), including 
comedocarcinoma; 2) infiltrating (or invasive) ductal carcinoma (IDC); 3) lobular carcinoma in situ 
(LOS): 4) infiltrating (or invasive) lobular carcinoma (ILC): 5) inflammatory breast cancer: 6) 
1 5 medullary carcinoma; 7) mucinous carcinoma: 8) Paget's disease of the nipple; 9) Phyllodes tumor; 
and 10) tubular carcinoma: 

The expression of polynucleotides of the invention can be used in the diagnosis and 
management of breast cancer, as well as to distinguish between types of breast cancer. Detection of 
breast cancer can be determined using expression levels of any of the appropriate polynucleotides of 
20 the invention, either alone or in combination. Determination of the aggressive nature and/or the 
metastatic potential of a breast cancer can also be determined by comparing levels of one or more 
polynucleotides of the invention and comparing levels of another sequence known to vary in 
cancerous tissue, e.g. ER expression. In addition, development of breast cancer can be detected by 
examining the ratio of expression of a differentially expressed polynucleotide to the levels of steroid 
25 hormones (e.g., testosterone or estrogen) or to other hormones (e.g., growth hormone, insulin). Thus 
expression of specific marker polynucleotides can be used to discriminate between normal and 
cancerous breast tissue, to discriminate between breast cancers with different cells of origin, to 
discriminate between breast cancers with different potential metastatic rates, etc. 

Detection of colon cancer. The polynucleotides of the invention exhibiting the appropriate 
30 expression pattern can be used to detect colon cancer in a subject. Colorectal cancer is one of the 
most common neoplasms in humans and perhaps the most frequent form of hereditary neoplasia. 
Prevention and early detection are key factors in controlling and curing colorectal cancer. 
Colorectal cancer begins as polyps, which are small, benign growths of cells that form on the inner 
lining of the colon. Over a period of several years, some of these polyps accumulate additional 
35 mutations and become cancerous. Multiple familial colorectal cancer disorders have been identified, 
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which are summarized as follows: 1 ) Familial adenomatous polyposis (FAP): 2) Gardner's 
syndrome; 3) Hereditary nonpolyposis colon cancer (HNPCC): and 4) Familial colorectal cancer in 
Ashkenazi Jews. The expression of appropriate polynucleotides of the invention can be used in the 
diagnosis, prognosis and management of colorectal cancer. Detection of colon cancer can be 
determined using expression levels of any of these sequences alone or in combination with the levels 
of expression. Determination of the aggressive nature and/or the metastatic potential of a colon 
cancer can be determined by comparing levels of one or more polynucleotides of the invention and 
comparing total levels of another sequence known to vary in cancerous tissue, e.g., expression of 
p53. DCC ras. lor FAP (see. e.g.. Fearon ER, et aL Cell (1990) 6J(5}:159: Hamilton SR el aL 
Cancer ( 1993) '2:957; Bodmcr W. et aL Nat Genet. (1994) 4(3):2\7: Fearon ER. Ann N Y Acad Sci. 
(1995) 768: 10 1 ). For example, development of colon cancer can be detected by examining the ratio 
of any of the polynucleotides of the invention to the levels of oncogenes (e.g. ras) or tumor 
suppressor genes {e.g. FAP or p53). Thus expression of specific marker polynucleotides can be used 
to discriminate between normal and cancerous colon tissue, to discriminate between colon cancers 
with different cells of origin, to discriminate between colon cancers with different potential 
metastatic rates, etc. 

Use of Polynucleotides to Screen for Peptide Analogs and Antagonists 
Polypeptides encoded by the instant polynucleotides and corresponding full length genes 
can be used to screen peptide libraries to identify binding partners, such as receptors, from among 
the encoded polypeptides. Peptide libraries can be synthesized according to methods known in the 
art (see. e.g.. USPN 5.010.175 . and WO 91/1 7823). Agonists or antagonists of the polypeptides if 
the invention can be screened using any available method known in the art. such as signal 
transduction, antibody binding, receptor binding, mitogenic assays, chemotaxis assays, etc. The 
assay conditions ideally should resemble the conditions under which the native activity is exhibited 
in vivo, that is. under physiologic pH, temperature, and ionic strength. Suitable agonists or 
antagonists will exhibit strong inhibition or enhancement of the native activity at concentrations that 
do not cause toxic side effects in the subject. Agonists or antagonists that compete for binding to the 
native polypeptide can require concentrations equal to or greater than the native concentration, while 
inhibitors capable of binding irreversibly to the polypeptide can be added in concentrations on the 
order of the native concentration. 

Such screening and experimentation can lead to identification of a novel polypeptide 
binding partner, such as a receptor, encoded by a gene or a cDNA corresponding to a polynucleotide 
of the invention, and at least one peptide agonist or antagonist of the novel binding partner. Such 
agonists and antagonists can be used to modulate, enhance, or inhibit receptor function in cells to 
which the receptor is native, or in cells that possess the receptor as a result of genetic engineering. 
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Further, if the novel receptor shares biologically important characteristics with a known receptor, 
information about agonist/antagonist binding can facilitate development of improved 
asonists/antagonists of the known receptor. 

P harmaceutical Compositions and Therapeutic Uses 
5 Pharmaceutical compositions of the invention can comprise poly peptides, antibodies, or 

polvnucleotides (including antisense nucleotides and ribozymes) of the claimed invention in a 
therapeutically effective amount. The term "therapeutically effective amount" as used herein refers 
to an amount of a therapeutic agent to treat, ameliorate, or prevent a desired disease or condition, or 
to exhibit a detectable therapeutic or preventative effect. The effect can be detected by. for example. 
10 chemical markers or antigen levels. Therapeutic effects also include reduction in physical 

symptoms, such as decreased body temperature. The precise effective amount for a subject will 
depend upon the subject's size and health, the nature and extent of the condition, and the therapeutics 
or combination of therapeutics selected for administration. Thus, it is not useful to specify an exact 
effective amount in advance. However, the effective amount for a given situation is determined by 
15 routine experimentation and is within the judgment of the clinician. For purposes of the present 

invention, an effective dose will generally be from about 0.01 mg/ kg to 50 mg/kg or 0.05 mg/kg to 
about 10 mg/kg of the DNA constructs in the individual to which it is administered. 

A pharmaceutical composition can also contain a pharmaceutical^ acceptable carrier. The 
term •'pharmaceuticaily acceptable carrier"' refers to a carrier for administration of a therapeutic 
20 agent, such as antibodies or a polypeptide, genes, and other therapeutic agents. The term refers to 
any pharmaceutical carrier that does not itself induce the production of antibodies harmful to the 
individual receiving the composition, and which can be administered without undue toxicity. 
Suitable carriers can be large, slowly metabolized macromolecules such as proteins, 
polysaccharides, polylactic acids, polyglycolic acids, polymeric amino acids, amino acid 
25 copolymers, and inactive virus particles. Such carriers are well known to those of ordinary skill in 
the art. Pharmaceuticaily acceptable carriers in therapeutic compositions can include liquids such as 
water, saline, glycerol and ethanol. Auxiliary substances, such as wetting or emulsifying agents. pH 
buffering substances, and the like, can also be present in such vehicles. Typically, the therapeutic 
compositions are prepared as injectables, either as liquid solutions or suspensions: solid forms 
30 suitable for solution in. or suspension in. liquid vehicles prior to injection can also be prepared. 
Liposomes are included within the definition of a pharmaceuticaily acceptable earner. 
Pharmaceuticaily acceptable salts can also be present in the pharmaceutical composition, e.g.. 
mineral acid salts such as hydrochlorides, hydrobromides. phosphates, sulfates, and the like; and the 
salts of organic acids such as acetates, propionates, malonates. benzoates, and the like. A thorough 
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discussion of pharmaceutical ly acceptable excipients is available in Remington s Pharmaceutical 
Sciences (Mack Pub. Co., N.J. 1991). 

Delivery Methods. Once formulated, the compositions of the invention can be 
( 1 ) administered directly to the subject (e.g.. as polynucleotide or polypeptides): or (2) delivered ex 
5 vivo, to cells derived from the subject {e.g., as in ex vivo gene therapy ). Direct delivery of the 
compositions will generally be accomplished by parenteral injection, e.g., subcutaneousiy. 
intrapentoneally. intravenously or intramuscularly, mtratumoral or to the interstitial space of a 
tissue. Other modes of administration include oral and pulmonary administration, suppositories, and 
transdermal applications, needles, and gene guns or hyposprays. Dosage treatment can be a single 
1 0 dose schedule or a multiple dose schedule. 

Methods for the ex vivo delivery and reimplantation of transformed cells into a subject are 
known in the art and described in e.g.. International Publication No. WO 93/14778. Examples of 
cells useful in ex vivo applications include, for example, stem cells, particularly hemaiopoetic. 
lymph cells, macrophages, dendritic cells, or tumor cells. Generally, delivery' of nucleic acids for 
1 5 both ex vivo and in vitro applications can be accomplished by. for example, dextran- mediated 

transfection. calcium phosphate precipitation, polybrene mediated transfection. protoplast fusion, 
electroporation. encapsulation of the polynucleotide(s) in liposomes, and direct microinjection of the 
DNA into nuclei, all well known in the art. 

Once a gene corresponding to a polynucleotide of the invention has been found to correlate 
20 with a proliferative disorder, such as neoplasia, dysplasia, and hyperplasia, the disorder can be 
amenable to treatment by administration of a therapeutic agent based on the provided 
polynucleotide, corresponding polypeptide or other corresponding molecule (e.g.. antisense. 
ribozyme. etc.). 

The dose and the means of administration of the inventive pharmaceutical compositions are 
25 determined based on the specific qualities of the therapeutic composition, the condition, age. and 
weight of the patient, the progression of the disease, and other relevant factors. For example, 
administration of polynucleotide therapeutic compositions agents of the invention includes local or 
systemic administration, including injection, oral administration, particle gun or catheterized 
administration, and topical administration. Preferably, the therapeutic polynucleotide composition 
30 contains an expression construct comprising a promoter operably linked to a polynucleotide of at 

least 12. 22. 25, 30, or 35 contiguous nt of the polynucleotide disclosed herein. Various methods can 
be used to administer the therapeutic composition directly to a specific site in the body. For 
example, a small metastatic lesion is located and the therapeutic composition injected several times 
in several different locations within the body of tumor. Alternatively, arteries which serve a tumor 
35 are identified, and the therapeutic composition injected into such an artery, in order to deliver the 
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composition directly into the tumor. A tumor that has a necrotic center is aspirated and the 
composition injected directly into the now empty center of the tumor. The antisense composition is 
directly administered to the surface of the tumor, for example, by topical application of the 
composition. X-ray imaging is used to assist in certain of the above delivery methods. 

5 Receptor-mediated targeted delivery of therapeutic compositions containing an antisense 

polynucleotide, subgenomic polynucleotides, or antibodies to specific tissues can also be used. 
Receptor-mediated DNA delivery techniques are described in. for example, Findeis el uL. Trends 
Biotechnol. (1993) //:202: Chiou et aL Gene Therapeutics: Methods And Applications Of Direct 
Gene Transfer (J. A Wolff, ed.) ( 1 994); Wu et al. . ,/. Biol. Chem. (1988) 263:62 1 ; Wu et aL J. Biol. 

1 0 Chem. ( 1 994) 269:542: Zenke et aL, Proc. Natl. Acad. Sci. (USA) ( 1 990) #7:3655: Wu et al., J. Biol. 
Chem. ( 1991 ) 266:338. Therapeutic compositions containing a polynucleotide are administered in a 
ranee of about 100 ng to about 200 mg of DNA tor local administration in a gene therapy protocol. 
Concentration ranges of about 500 ng to about 50 mg. about 1 g to about 2 mg. about 5 g to 
about 500 g, and about 20 g to about 1 00 g of DNA can also be used during a gene therapy 

15 protocol Factors such as method of action (e.g.. for enhancing or inhibiting levels of the encoded 
gene product) and efficacy of transformation and expression are considerations which will affect the 
dosage required for ultimate efficacy of the antisense subgenomic polynucleotides. Where greater 
expression is desired over a larger area of tissue, larger amounts of antisense subgenomic 
polynucleotides or the same amounts readmimstered in a successive protocol of administrations, or 

20 several administrations to different adjacent or close tissue portions of. for example, a tumor site, 
may be required to effect a positive therapeutic outcome. In all cases, routine experimentation in 
clinical trials will determine specific ranges for optimal therapeutic effect. For polynucleotide- 
related genes encoding polypeptides or proteins with anti-inflammatory activity, suitable use. doses, 
and administration are described in USPN 5.654.173. 

25 The therapeutic polynucleotides and polypeptides of the present invention can be delivered 

using gene delivery vehicles. The gene delivery vehicle can be of viral or non-viral origin (see 
generally. Jolly. Cancer Gene Therapy (1994) 7:51; Kimura, Human Gene Therapy (1994) 5:845; 
Connelly, Human Gene Therapy (1 995) 7:185; and Kaplitt. Nature Genetics (1994) 6:148). 
Expression of such coding sequences can be induced using endogenous mammalian or heterologous 

30 promoters. Expression of the coding sequence can be either constitutive or regulated. 

Viral-based vectors for delivery of a desired polynucleotide and expression in a desired cell 
are well known in the art. Exemplary viral-based vehicles include, but are not limited to. 
recombinant retroviruses (see. e.g., WO 90/07936; WO 94/03622; WO 93/25698; WO 93/25234; 
USPN 5. 2 1 9.740; WO 93/1 1 230; WO 93/1 02 1 8; USPN 4,777, 1 27: GB Patent No. 2.200.65 1 ; EP 0 

35 345 242; and WO 91/02805). alphavirus-based vectors (e.g.. Sindbis virus vectors. Semliki forest 
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virus (A ICC VR-67: ATCC VR- 1 217). Ross River virus (A I CC VR-37.3: ATCC VR- 1 246) and 
Venezuelan equine encephalitis virus (ATCC VR-923: ATCC VR-1250: ATCC VR 1249: ATCC 
VR-532). and adeno-associated virus (AAV) vectors (see. e.g.. WO 94 1 2649. WO 93/03769: WO 
93/19191: WO 94/28938: WO 95/1 1984 and WO 95/00655). Administration of ON A linked to 
killed adenovirus as described in Curiel. Hum. Gene Ther. ( 1992) 5:147 can also be employed. 

Non-v.ral delivery vehicles and methods can also be employed, including, but not limned to. 
polycationic condensed DNA linked or unlinked to killed adenovirus alone (see. e.g.. Curiel. Hum 
Gene Ther. (1992) 5:147); ligand-linked DNA(see. e.g.. Wu. J. Biol. Chem. ( 1989) 267:16985). 
eukaryotic cell delivery vehicles cells (see. e.g.. USPN 5.814.482: WO 95/07994. WO 96/1 7072: 
WO 95/30763: and WO 97/42338) and nucleic charge neutralization or fusion with cell membranes. 
Naked DNA can also be employed. Exemplary naked DNA introduction methods arc described in 
WO 90/1 1092 and USPN 5.580.85'). Liposomes that can act as gene delivery vehicles are described 
in USPN 5.422.120: WO 95/1 3796: WO 94/23697: W091/14445: and EP 0524968. Additional 
approaches are described in Philip. Mol. Cell Biol. (1994) 14:24] I. and in Woffendin. Proc. Nail. 

Acad. Sci. (1994) 91: 1581 

Further non-viral delivery suitable for use includes mechanical delivery systems such as the 
approach described in Woffendin el aL Proc. Nail. Acad. Sc. USA (1994) 97(24): 1 1581 . Moreover, 
the coding sequence and the product of expression of such can be delivered through deposition of 
photopolymerized hydrogel materials or use of ionizing radiation (see. e.g.. USPN 5.206.1 52 and 
WO 92/1 1033). Other conventional methods for gene delivery that can be used for delivery of the 
coding sequence include, for example, use of hand-held gene transfer particle gun (see. e.g.. USPN 
5.149.655): use of ionizing radiation for activating transferred gene (see. e.g.. USPN 5.206.152 and 
WO 92/1 1033). 

The present invention will now be illustrated by reference to the following examples which 
set forth particularly advantageous embodiments. However, it should be noted that these 
embodiments are illustrative and are not to be construed as restricting the invention in any way. 

EXAMPLES 

Examp |e 1 ■ Source of Biological Materials and Overview of Novel Polynucleotides Expressed 

b\ the Biological Materials 

cDNA libraries were constructed from either human colon cancer cell line Km 12L4-A 
(Morikawa. et a!.. Cancer Research (1988) 4S:6863) , KM12C (Morikawa et al. Cancer Res (1988) 
45:1943-1948). or MDA-MB-231 (Brinkley et al. Cancer Res. (1980) 40:31 18-3129) was used to 
construct a cDNA library from mRNA isolated from the cells. Sequences expressed by these cell 
lines were isolated and analyzed: most sequences were about 275-300 nucleotides in length. The 
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KM12L4-A cell line is derived from the KM12C cell line. The KM12C cell line, which is poorly 
metastatic (low metastatic) was established in culture from a Dukes' stage B 2 surgical specimen 
(Morikawa et ai Cancer Res. ( 1 988) V#:6863). The KML4-A is a highly metastatic subline derived 
from KM12C (Yeatinan et ai NucL Acids. Res. (1995) 23:4007: Bao-Ling et a!. Proc. Anna. Meet. 
Am Assoc. Cancer. Res. (1995) 2/:3269). The KM12C and KMI2C-derived cell lines [e.g., 
KM12L4, KM12L4-A. etc. ) are well-recognized in the art as a model cell line for the study of colon 
cancer (see. e.g., Moriakawa et ai. supra: Radinsky et ai Clin. Cancer Res. (1995) 7:19: Yeatinan 
et ai, (1995) supra: Yeatman et ai Clin. Exp Metastasis (1996) 77:246). The MDA-MB-23 1 cell 
line was originally isolated from pleural effusions (Cailleau. J. Natl. Cancer. Inst. (1974) 53:661), is 
of high metastatic potential, and forms poorly differentiated adenocarcinoma grade II in nude mice 
consistent with breast carcinoma. 

The sequences of the isolated polynucleotides were first masked to eliminate low 
complexity sequences using the XBLAST masking program (Claveric ' Effective Large-Scale 
Sequence Similarity Searches." In: Computer Methods for Macromolecula r Sequence Analysis, 
Doolittle. ed.. Metk Enzymoi 266:212-227 Academic Press. NY. NY (1996): see particularly 
Claverie, in "Automated DNA Sequencing and Analysis Techniques'* Adams et aL eds.. Chap. 36, 
p. 267 Academic Press. San Diego. 1994 and Claveric et ai Comput. Chetn. ( 1993) 12:191 ). 
Generally, masking does not influence the final search results, except to eliminate sequences of 
relative little interest due to their low complexity, and to eliminate multiple "hits'* based on 
similarity to repetitive regions common to multiple sequences, e.g.. Alu repeats. Masking resulted 
in the elimination of 43 sequences. The remaining sequences were then used in a BLASTN vs. 
GenBank search: sequences that exhibited greater than 70% overlap. 09% identity, and a p value of 
less than 1 x 10' 4 ° were discarded. Sequences from this search also were discarded if the inclusive 

parameters were met. but the sequence was ribosomal or vector-derived. 
The resulting sequences from the previous search were classified into three groups ( 1. 2 and 3 
below) and searched in a BLASTX vs. NRP (non-redundant proteins) database search: (1) unknown 
(no hits in the GenBank search), (2) weak similarity (greater than 45% identity and p value of less 
than 1 x 10" 5 ), and (3) high similarity (greater than 60% overlap, greater than 80% identity, and p 
vaiue les> than 1 x i0*'\ Sequences having greater than 7 0% overlap greater than Q°% identity, and 

p value of less than 1 x 1 0" 4 ° were discarded. 

The remaining sequences were classified as unknown (no hits), weak similarity , and high 
similarity (parameters as above). Two searches were performed on these sequences. First, a 
BLAST vs. EST database search was performed and sequences with greater than 99% overlap, 
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greater than 99% similarity and a p value of less than I \ 10 were discarded. Sequences with a p 

value of less than 1 x 1 0~ 6 ~ when compared to a database sequence of human origin were also 

excluded. Second, a BLASTN vs. Patent GeneSeq database was performed and sequences having 

-40 

greater than 99% identity, p value less than 1x10 .and greater than 99% overlap were discarded. 

5 The remaining sequences were subjected to screening using other rules and redundancies in 

the dataset. Sequences with a p value of less than i x 10 1 1 1 in relation to a database sequence of 

human origin were specifically excluded. The final result provided the 1.565 sequences listed as 
SEQ ID NOS: 1-1 565 in the accompanying Sequence Listing and summarized in Table 1 A {inserted 
prior to claims) Each identified polynucleotide represents sequence from at least a partial mRNA 
10 transcript. 

fable 1 A provides: I ) the SEQ ID NO assigned to each sequence for use in the present 
specification; 2 ) the filing date of the U.S. priority application in which the sequence was first filed: 

3) the attorney docket number assigned to the priority application (for internal use): 4) the SEQ ID 
NO assigned to the sequence in the priority application: 5) the sequence name used as an internal 

1 5 identifier of the sequence: and 6) the name assigned to the clone from which the sequence was 
isolated. Because the provided polynucleotides represent partial mRNA transcripts, two or more 
polynucleotides of the invention may represent different regions of the same mRNA transcript and 
the same gene. Thus, if two or more SEQ ID NOS: are identified as belonging to the same clone, 
then either sequence can be used to obtain the full-length mRNA or gene. 

20 In order to confirm the sequences of SEQ ID NOS: 1 - 1 565. the clones were retrieved from a 

library using a robotic retrieval system, and the inserts of the retrieved clones re-sequenced. These 
"validation" sequences are provided as SEQ ID NOS: 1 566-2610 in the Sequence Listing, and a 
summary of the "validation" sequences provided in Table IB (inserted prior to claims). Table IB 
provides: I ) the SEQ ID NO assigned to each sequence for use in the present specification: 2) the 

25 sequence name assigned to the "validation^ sequence obtained; 3) whether the "validation" 
sequence contains sequence that overlaps with an original sequence of SEQ ID NOS: 1-1565 
(Validation Overlap (VO)) ? or whether the "validation" sequence does not substantially overlap with 
an original sequence of SEQ ID NOS: 1-1 565 (indicated bv Validation Non-Overlap (VNO)); and 

4) where the sequence is indicated as VO, the name of the clone that contains the indicated 
30 "validation" sequence. "Validation" sequences are indicated as u VO" where the "validation" 

sequence overlaps with an original sequence (e.g., one of SEQ ID NOS: 1-1565), and/or the 
"validation" sequence belongs to the same cluster as the original sequence using the clustering 
technique described above. Because the inserts of the clones are generally longer than the original 
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sequence and the validation sequence, it is possible that a "validation" sequence can be obtained 
from the same clone as an original sequence but yet not share any of the sequence of the original. 
Such validation sequences will, however, belong to the same cluster as the original sequence using 
the clustering technique described above. VO "validation" sequences are contained within the same 

5 clone as the original sequence (one of SEQ ID NOS: 1-1565). "Validation"' sequences that provided 
overlapping sequence are indicating by "VO" can be correlated with the original sequences they 
validate by referring to Table 1 A. Sequences indicated as VNO are treated as newly isolated 
sequences and may or may not be related to the sequences of SEQ IE) NOS: 1-1565. Because the 
"validation" sequences are often longer than the original polynucleotide sequences and thus provide 

10 additional sequence information. All validation sequences can be obtained either from an indicated 
clone (e?.#., for VO sequences) or from a cDNA library described herein (e.g.. using primers 
designed from the sequence provided in the sequence listing). 

Example 2: Results of Public Database Search to Identify Function of Gene Products 
15 SEQ ID NOS: 1566-2610 were translated in all three reading frames, and the nucleotide 

sequences and translated amino acid sequences used as query sequences to search for homologous 

sequences in either the GenBank (nucleotide sequences) or Non-Redundant Protein (amino acid 

sequences) databases. Query and individual sequences were aligned using the BLAST 2.0 programs. 

available over the world wide web at http://ww.ncbi.nlm.nih.gov/BLAST. 7 . (see also Altschul. et al. 
20 Xucleic Acids Res. (1997) 25:3389-3402). The sequences were masked to various extents to prevent 

searching of repetitive sequences or poly-A sequences, using the XBLAST program for masking low 

complexity as described above in Example 1 . 

Tables 2A and 2B (inserted before the claims) provide the alignment summaries having a p 

value of 1 x 10 2 or less indicating substantial homology between the sequences of the present 

25 invention and those of the indicated public databases. Table 2A provides the SEQ ID NO of the 
query sequence, the accession number of the GenBank database entry of the homologous sequence, 
and the p value of the alignment. Table 2A provides the SEQ ID NO of the query sequence, the 
accession number of the Non-Redundant Protein database entry of the homologous sequence, and 
the p value of the alignment. The alignments provided in Tables 2A and 2B are the best available 

j0 alignment to a DNA or amino acid sequence at a time just prior to filing of the present specification. 
The activity of the polypeptide encoded by the SEQ ID NOS listed in Tables 2A and 2B can be 
extrapolated to be substantially the same or substantially similar to the activity of the reported 
nearest neighbor or closely related sequence. The accession number of the nearest neighbor is 
reported, providing a publicly available reference to the activities and functions exhibited by the 
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nearest neighbor. I he public information regarding the activities and functions of each of the 
nearest neighbor sequences is incorporated by reference in this application. Also incorporated by 
reference is all publicly available information regarding the sequence, as well as the putative and 
actual activities and functions of the nearest neighbor sequences listed in Table 2 and their related 
sequences. The search program and database used for the alignment, as well as the calculation of 
the p value are also indicated. 

Full length sequences or fragments of the polynucleotide sequences of the nearest neighbors 
can be used as probes and primers to identify and isolate the full length sequence of the 
corresponding polynucleotide The nearest neighbors can indicate a tissue or cell type to be used to 
construct a library for the full-length sequences of the corresponding polynucleotides. 



Example 3: Members of Protein Families 

SEQ ID NOS: 1566-2601 were used to conduct a profile search as described in the 
specification above. Several of the polynucleotides of the invention were found to encode 
1 5 polypeptides having characteristics of a polypeptide belonging to a known protein family (and thus 
represent new members of these protein families) and/or comprising a known functional domain 
(Table 3A, inserted prior to claims). Table 3A provides the SEQ ID NO: of the query sequence, a 
brief description of the profile hit. the position of the query sequence within the individual sequence 
(indicated as -start and "stop"), and the orientation (Direction) of the query sequence with respect 
20 to the individual sequence, where forward (for) indicates that the alignment is in the same direction 
(left to right) as the sequence provided in the Sequence Listing and reverse (rev) indicates that the 
alignment is with a sequence complementary to the sequence provided in the Sequence Listing. 

Some polynucleotides exhibited multiple profile hits where the query sequence contains 
overlapping profile regions, and/or where the sequence contains two different functional domains. 
25 Each of the profile hits of Table 3 A are described in more detail below. The acronyms for the 
profiles (provided in parentheses) are those used to identify the profile in the Pfam and Prosite 
databases. The Pfam database can be accessed through any of the following URLS: 
http://pfaniA\ustl.edLi/mde\.html : http://www.sanger.ac.iik/ Software/Pfam/: and 
http://wAwv.CLir.ki.se/Pfai-n/ . The Prosite database can be accessed at hnp: '7www.expasv.ch/prosite/. 
30 The public information available on the Pfam and Prosite databases regarding the various profiles, 
including but not limited to the activities, function, and consensus sequences of various proteins 
families and protein domains, is incorporated herein by reference. 

14-3-3 Family (14 3 3). SEQ ID NO: 1967 corresponds to a sequence encoding a 14-3-3 
protein family member. The 14-3-3 protein family includes a group of closely related acidic 
35 homodimenc proteins of about 30 kD first identified as very abundant in mammalian brain tissues 
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and located preferentially in neurons (Aitken et al. Trends Biochcm Sc. (1995) 20:95-97: Morrison 
Saence (\994) 266:56-51: and Xiao et al. Afci/nr*<1995) 376:188-191 ). The 14-3-3 proteins have 
multiple biological activities, including a key role in signal transduction pathways and the cell cycle. 
14-3-3 proteins interact with kinases (e.g.. PKC or Raf-1). and can also function as protein-kinase 
5 dependent activators of tyrosine and tryptophan hydroxylases. The 14-3-3 protein sequences are 
extremely well conserved, and include wo highly conserved regions the first is a peptide of 1 1 
residues located in the N-terminal section: the second, a 20 amino acid region located in the C- 
terminal section. The consensus patterns are as follows: I ) R-N-L-[L1V]-S-[VG]-[GA]-Y-[KN]-N- 
[|VA]:2)Y-K-[DE]-S-T-L-l-flM]-Q-L-[LFl-[RHCl-D-N-[LFj-T-lLS]-W-[TAN]-[SAD]. 
10 VV-rvclin Nucleotide Phosphodiesterases (PDEase ). SEQ ID NO: 2366 represents a 

polynucleotide encoding a novel TS'-cyclic nucleotide phosphodiesterase. PDEases catalyze the 
hydrolysis of cAMP or cGMP to the corresponding nucleoside 5' monophosphates (Charbonneau et 
al. Proc. Null. Acad Sci. U.S.A. (1986) <S3:9308). There are at least seven different subfamilies of 
PDEases (Bcavo et al.. Trends Pharmacol. Sa. ( 1990) / /: 1 50: http://weber.u.washington.edu/~pde/: 
15 1 ) Type 1 . calmodulin/calcium-dependent PDEases: 2) Type 2. cGMP-stimulated PDEases: 3) Type 
3. cGMP-inhibited PDEases: 4) Type 4. cAMP-specific PDEases.: 5) Type 5, cGMP-specific 
PDEases: 6) Type 6. rhodopsin-sensitive cGMP-specific PDEases: and 7) Type 7. High affinity 
cAMP-specific PDEases. All PDEase forms share a conserved domain of about 270 residues. The 
signature pattern is determined from a stretch of 12 residues that contains two conserved histid.nes: 
20 H-D-lLlVMFY]-x-H-x-[AG]-x(2)-[NQ]-x-[LlVMFY]. 

Fou r Transmembrane Integral Membrane Pro teins <transmembrane4). SEQ ID NOS:1579 
and 1978 sequences correspond to a sequence encoding a member of the four transmembrane 
segments integral membrane protein family (tm4 family). The tm4 family of proteins includes a 
number of evolutionarily-related eukaryotic cell surface antigens (Levy et ai.J. Biol. Chem.. (1991) 
25 266: 14597: Tomlinson el al.. Eur. J. Immunol. (1993) 23:136: Barclay et al. The leucocyte antigen 
factbooks. (1993) Academic Press, London/San Diego). The tm4 family members are type III 
membrane proteins, which are integral membrane proteins containing an N-terminal membrane- 
anchoring domain that functions both as a translocation signal and as a membrane anchor. The 
family members also contain three additional transmembrane regions, at least seven conserved 
30 cysteines residues, and are of approximatelv the same si/e (218 to 284 residues). The consensus 
partem spans a conserved region including two cysteines located in a short cytoplasmic loop 
between two transmembrane domains: Consensus partem: G-x(3)-(LlVMF]-x(2)-[GSA]- 
[LlVMF](2)-G-C-x-[GA]-[STA]-x(2)-[EG)-x(2)-[CWN]-[LlVM](2). 

Seven Transmembrane Integral Membrane Pr oteins - Rhodonsin Family (7tm 1 ). SEQ ID 
35 NOS:1652. 1927. and 2068 correspond to a sequence encoding a member of the seven 
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transmembrane (7tm) receptor rhodopsin family. G-protein coupled receptors of the (7tm) 
rhodopsin family include hormones, neurotransmitters, and light receptors that transduce 
extracellular signals by interaction with guanine nucleotide-binding (G) proteins (Strosberg Eur. J 
Biochem. ( 1 991) 196: 1, Kcrlavage Curr. Opin. Struct. Biol. ( 1991 ) 7:394. Probst, et aL DNA Cell 
5 Biol. (1992) Savarese, et aL, Biochem. I (1992)2^:1. imp: ■ wuw»crdh.iiih^sa.edu/. 

Imp: '.'svvilt.embl-heidelberii.de/7im/ ) The consensus pattern that contains the conserved triplet and 
that also spans the major part of the third transmembrane heiix is used to detect this widespread 
family of proteins: [GSTALIVMFYWC]-[GSTANCPDE]-{EDPKRH}-x(2)-lLIVMNOGA]-x(2)- 
[LIVMFTJ-[GSTANC]-IL1VMFYWSTAC]-[DENH]-R-[FYWCSH]-x(2)-[L1VM]. 
10 Seven Transmembrane integral Membrane Proteins - Secre tin Family (7trn 2). SEQ ID 

N0S:I598. 1719. 1911. 1 927. 2068, and 234 1 correspond to a sequence encoding a member of the 
seven transmembrane receptor (7tm) secretin family (Jueppner ct al. Science (1991 ) 254: 1 024: 
Hamann et al. Genomics (1996) J2:144). The N-terminal extracellular domain of these receptors 
contains five conserved cysteines residues involved in disulfide bonds, with a consensus pattern in 
1 5 the region that spans the first three cysteines. One of the most highly conserved regions spans the C- 
terminal part of the last transmembrane region and the beginning of the adjacent intracellular region 
and is used as a second signature pattern. The two consensus patterns are: 1 ) C-x(3)-[FYWLIV]-D- 
x(3,4)-C-[FW]-x(2).[STAGV]-x(8,9>C-[PF]: and 2) Q-G-[LMFCA]-[LIVMFTJ-[l IV]-x- 
[LIVFST]-[LIFHVFYH]-C- [LFY]-x-N-x(2)-V 
20 ATPases Associated with Various Cellular Activities (ATPases). Several of the 

polynucleotides of the invention correspond to a sequence that encodes a member of a family of 
ATPases Associated with diverse cellular Activities (AAA). The AAA protein family is composed 
of a large number of ATPases that share a conserved region of about 220 amino acids containing an 
ATP-binding site (Froehlich el aL, J. Cell Biol ( 1991 ) 7/7:443: Erdmann et al Cell (1991) 67:499: 
25 Peters et al. EM BO J. (1990) 9: 1 757; Kunau et aL Biochimie (1993) 75:209-224: Confalonieri et 
aL BioEssays (1995) /7 : 639; http://yeamob.pci. chemie.uni-tuebingen.de/AAA/Description.html). 
The AAA domain, which can be present in one or two copies, acts as an ATP-dependent protein 
clamp (Confalonieri et al (1995) BioEssays 1 7:639) and contains a highly conserved region located 
in the central part of the domain. The consensus pattern is: [LlVMT]-x-[LIVMT]-[LlVMF]-x- 
?0 [GATMC]-[STl-[NS]-\(4)-rLIVM]- D-x-A-[I IFAl-x-R. 

Basic Region Plus Leucine Zipper Transcription Factors (BZ1P) SEQ ID NO: 1623 
represents a polynucleotide encoding a novel member of the family of basic region plus leucine 
zipper transcription factors. The bZIP superfamily (Hurst. Protein Prof. (1995) 2: 105: and 
Ellenberger. Curr. Opin. Struct. Biol. (1994) 7:12) of eukaryotic DNA-binding transcription factors 
35 encompasses proteins that contain a basic region mediating sequence-specific DNA-binding 
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followed bv a leucine zipper required for dimerizaiion. The consensus pattern for this protein family 
is: [KR]-x(1.3)-[RKSAO]-N-x(2)-[SAQ](2)-x-[RKTAENQ]-\-R-x-[RKl. 

C2 domain (C2). SEQ ID NOS: 1 7 1 5 and 2426 correspond to a sequence encoding a C2 
domain, which is involved in calcium-dependent phospholipid binding (Davletov J. Biol. Chem. 
5 (1993) 26^:26386-26390) or. in proteins that do not bind calcium, the domain may facilitate binding 
to inositol- UA5-tetraphosphate(Fukudaet al. J. Biol. Chem. (1994)269:29206-2921 I: Sutton et 
al. 0.7/(1995) #0:929-938). The consensus sequence is: [ACGl-x(2)-L-x(2.3)-D-x( 1 ,2)- 
[NGSTLIF]-[GTMR]-x-fS'rAP]-D- [PA]-[FY]. 

Cysteine proteases (Cvs-protease). SEQ ID NO:2238 represents a polynucleotide encoding 

10 a protein having a eukaryotic thiol (cysteine) protease active site. Cysteine proteases (Dufour 

Biochimw (1988) 70:1335) are a family of proteolytic enzymes that contain an active site cysteine. 
Catalysis proceeds through a thioester intermediate and is facilitated by a nearby histidme side 
chain: an asparagine completes the essentia) catalytic triad. The sequences around the three active 
site residues are well conserved and can be used as signature patterns: Q-x(3HGE]-x-C-[YW]-x(2)- 

15 [STAGC]-[STAGCV] (where C is the active site residue); 2) [LIVMGSTAN]-x-M-[GSACE]- 
[LIVM]-x-[LIVMAT](2)-G-x-[GSADNH] (where H is the active site residue); and 3) [FYCH]- 
[WI>[LIVT]-x-[KROAG]-N-[ST]-W-x(3)-[FYW]-G-x(2)-G- [LFYWJ-[LIVMFYG]-x-[LIVMF] 
(where N is the active site residue). 

DEAD and DEAH box families ATP-dependent helicases (Dead box helic). SEQ ID 

20 NOS: 1 630. 1865. and 25 1 7 represent polynucleotides encoding a novel member of the DEAD and 
DEAH box families (Schmid et al., Mol Microbiol. (1992) d:283: Linder et al.. Nature (1989) 
Ji7: 121 ; Wassaiman. et al., Nature (1991 ) 379:463). All members of these families are involved in 
ATP-dependent. nucleic-acid unwinding. All DEAD box family members share a number of 
conserved sequence motifs, some of which are specific to the DEAD family, with others shared by 

25 other ATP-binding proteins or by proteins belonging to the helicases superfamily* (Hodgman 
Nature (1988) 333:22 and Nature (1988) 353:578 (Errata); http://www.expasy.ch/ www/ linder/ 
HELICASES_ TEXT.html). One of these motifs, called the 'D-E-A-D-box\ represents a special 
version of the B motif of ATP-binding proteins. Proteins that have His instead of the second Asp 
and are 'D-E-A-H-box' proteins (Wassarman et al.. Nature (1991 ) 379:463; Marosh. et al.. Nucleic 

30 Acids Res (1991) 79:6331; Koonin , et ah, J. Gen Virol. (1992) 73:989; http://www.expasy.ch/ 
www/linder/HELICASES TEXT.html). The following signature patterns are used to identify 
member for both subfamilies: 1 ) [LIVMF](2)-D-E-A-D-[RKEN]-x-[LIVMFYGSTN]: and 2) 
[GSAH]-x-[L]VMF](3)-D-E-[ALIV]-H-rNECR]. 

Dual specificity phosphatase (DSPc). Dual specificity phosphatases (DSPs) are Ser/Thr and 

35 Tyr protein phosphatases that comprise a tertiary fold highly similar to that of tyrosine-specific 
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phosphatases, except for a "recognition" region connecting helix alpha! to strand betal. This 
tertiary fold may determine differences in substrate specific between VH-1 related dual specificity 
phosphatase (VHR). the protein tyrosine phosphatases (PTPs). and other DSPs. Phosphatases are 
important in the control of cell growth, proliferation, differentiation and transformation. 

5 EF Hand (EFhand). SEQ ID NO; 1595 corresponds to a polynucleotide encoding a member 

of the EF-hand protein family, a calcium binding domain shared by many calcium-binding proteins 
belonging to the same evolutionary' family (Kawasaki et aL. Protein. Prof. (1995) 2:305-490). The 
domain is a twelve residue loop flanked on both sides by a twelve residue alpha-helical domain, with 
a calcium ion coordinated in a pentagonal bipyramidal configuration. The six residues involved in 

1 0 the binding are in positions 1. 3, 5. 7. 9 and 12; these residues are denoted by X. Y. Z. -Y. -X and -Z. 
The invariant Glu or Asp at position 1 2 provides two oxygens for liganding Ca (bidentate ligand). 
The consensus pattern includes the complete EF-hand loop as well as the first residue which follows 
the loop and which seem to always be hydrophobic: D-x-[DNS]-{ILVFYW}-[DENSTG]- 
[DN0GHRKJ-{GP}-[LIVMC]-[DEN0STAGCJ-x(2)-[DE]-[LIVMFYW]. 

15 Eukarvotic Aspartvl Proteases (asp). Several of the polynucleotides of the invention 

correspond to a sequence encoding a novel eukarvotic aspartvl protease. Aspartyi proteases, known 
as acid proteases, (EC 3.4.23.-) are a widely distributed family of proteolytic enzymes (Foltmann., 
Essays Biochem. (1981) 77:52; Davies. Amnt. Rev. Biophys Chem. (1990) 79:189; Rao. etal. 
Biochemistry ( 1991 ) 30:4663) known to exist in vertebrates, fungi, plants, retroviruses and some 

20 plant viruses. Aspartate proteases of eukaryotes are monomeric enzymes which consist of two 
domains. Each domain contains an active site centered on a catalytic aspartyi residue. The 
consensus pattern to identify eukarvotic aspartyi protease is: [LIVMFGAC]-[L1VMTADN]- 
[LIVFSA]-D-fST]-G-[STAV]-[STAPDENQ]- x-[LlVMFSTNC]-x-[LlVMFGT A]. where D is the 
active site residue. 

25 Fibronectin Tvpe 11 collagen-binding domain (Fntvpell). SEQ ID NO: 1968 corresponds to 

a polynucleotide encoding a polypeptide having a type II fibronectin collagen binding domain. 
Fibronectin is a plasma protein that binds ceil surfaces and various compounds including collagen, 
fibrin, heparin, DNA, and actin. The major part of the sequence of fibronectin consists of the 
repetition of three types of domains, called type i, 11, and III (Skorstengaardet a!.. Eur. 1 Biochem. 

30 (1986) 767:441) The type II domain, which is duplicated in fibronectin. is approximately forty 
residues long, contains four conserved cysteines involved in disulfide bonds and is part of the 
collagen-binding region of fibronectin. The consensus pattern for identifying members of this 
family, which pattern spans this entire domain, is: C-x(2)-P-F-x-[FYWl]-x(7)-C-x(8J0)-W-C-x(4)- 
[DNSR]-[FYW]- x(3,5)-[FYW]-x-[FYWl]-C (where the four C's are involved in disulfide bonds). 

35 G-Protein Alpha Subunit (G-alpha) . SEQ ID NO: 1 779 corresponds to a gene encoding a 
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member of the G-protein alpha subunit family. G-protems are a family of membrane-associated 
proteins that couple extracellularly-activated integral-membrane receptors to intracellular effectors, 
such as ion channels and enzy mes that vary the concentration of second messenger molecules. G- 
protems are composed of 3 subunits (alpha, beta and gamma) which, in the resting state, associate as 

5 a tnmer at the inner face of the plasma membrane. The alpha subunit. which binds GTP and exhibits 
GTPase activity, is about 350-400 amino acids in length with a molecular weight in the range of 40- 
45 kDa. Seventeen distinct ty pes of alpha subunit have been identified in mammals, and fall into 4 
main groups on the basis of both sequence similarity and function: alpha-s. alpha-q. alpha-i and 
alpha- 12 (Simon et uL Science (1993) 252:802) They are often N-terminally acylated. usually with 

10 myristate and/or palm itoylate. and these fatty acid modifications can be important for membrane 
association and high- affinity interactions with other proteins. 

Hclicases conserved C-terminal domain (helicase C) . SEQ ID NOS: 162! and 1 652 
represent polynucleotides encoding novel members of the DEAD/H helicase family. The DEAD 
and DEAH families are described above. 

15 Helix-Loop-Helix (HUT) DNA Binding Domain (HLH). SEQ ID NO:2 192 corresponds to a 

sequence encoding an HLH domain. The HLH domain, which normally spans about 40 to 50 amino 
acids, is present in a number of eukaryotic transcription factors. The HLH domain is formed of two 
amphipathic helices joined by a variable length linker region that forms a loop that mediates 
protein dimerization (Murre et al. Cell (1989) 56:777-783). Basic HLH proteins (bHLH), which 

20 have an extra basic region of about 1 5 amino acid residues adjacent the HLH domain and 

specifically bind to DNA. include two groups: class A (ubiquitous) and class B (tissue-specific). 
bHLH family members bind variations of the E-box motif (CANNTG). The homo- or 
heterodimerization mediated by the HLH domain is independent of. but necessary for DNA binding, 
as two basic regions are required for DNA binding activity. The HLH proteins lacking the basic 

25 domain function as negative regulators since they form heterodimers. but fail to bind DNA. 

Consensus pattern: [DENSTAP]-[KTR]-[LIVMAGSNT]-{FYWCPHKR}-[LIVMT]-[LIVM]- x(2)- 
[STAV]-[LIVMSTACKR]-x-[VMFYH]-[LIVMTA]-{P}-{P}-[LIVMRKHQ]. 

Kinase Domain of Tors. The TOR profile is directed towards a lipid kinase protein family. 
This family is composed of large proteins with a lipid and protein kinase domain and characterized 

30 through their sensitivity to rapamycin (an antifungal compound). TOR proteins are involved in 
signal transduction downstream of PI3 kinase and many other signals. TOR (also called FRAP, 
RAFT) plays a role in regulating protein synthesis and cell growth., and in yeast controls translation 
initiation and early Gl progression. See, e.g., Barbet et al. Mol Biol Cell (1996) 7(l ):25-42: 
Helliwell et al Genetics (1998) 148WA 12. 

35 MAP kinase kinase (mkk). SEQ ID NOS: 1 825. 1 876. 2039. and 2526 represent members of 
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the MAP k.nase k.nasc (mkk) fan.il> . MAP kinases (MAPK) are involved in signal transduct.on. 
and are important in cell cycle and cell growth controls. The MAP kinase kinases (MAPKK) are 
dual-specificity protein kinases which phosphor) late and act.va.e MAP kinases. MAPKK 
homologt.es have been found in yeast, invertebrates, amphibians, and mammals Moreover, the 
MAPKK/MAPK phosphorylation switch constitutes a bas.c module activated in distinct pathways in 
^east and in vertebrates. MAPKKs are essenual transducers through which s.gnals must pass before 
reaching the nucleus. For review, see. e.g., Bio.ogique Biol Cell (1993) 79:193-207: Nishida et aL 
Trend, Biochem Sci ( 1 993) 18. 128-3 1 : Ruderman Curr Opin Cell Biol ( 1 993) 5:207- 13: 
Dhanasekaran e, aL Oncogene (1998) 77:1447-55: Kiefer e, al. Biochem Soc Trans (1997) 25:491- 

8: and Hill. Cell Signal (1996) 5:533-44. 

Neurotransmitter-Gated Ion-Channel (neur_chan). Several of the sequences 
correspond to a sequence encoding a neurotransmitter-gated ion channel. Neurotransmitter-gated 
ion-channels, which provide the molecular basis for rap.d signal transmission at chemical synapses, 
are post-synaptic oligomer.c transmembrane complexes that transiently form a ionic channel upon 
the binding of a specific neurotransmitter. Five types of neurotransmitter-gated receptors are 
known: 1) nicotinic acetylcholine receptor (AchR): 2) glycine receptor: 3) gamma-am inoburyric- 
acid (GABA) receptor: 4) serotonin 5HT3 receptor: and 5) glutamate receptor. All known sequences 
of subunits from neurotransmitter-gated ion-channels are structurally related, and are composed of a 
lar S e extracellular glycosylated N-terminal ligand-bindmg domain, followed by three hydrophobic 
transmembrane regions that form the ionic channel, followed by an intracellular reg.on of variable 
| ena th. A fourth hydrophobic reg.on is found at the C-terminal of the sequence. The consensus 
pattern is: C-x-[UVMFQ]-x-[LIVMF]-x(2)-lFY]-P-x-D-x(3)-C. where the two Cs are linked by a 

disulfide bond. 

Prn.Pin Kinase (orotkinase) . Several sequences represent polynucleotides encoding protein 
kinases, which catalyze phosphorylation of proteins in a variety of pathways, and are implicated in 
cancer. Eukaryot.c protein kinases (Hanks, et al.. FASEB J. (1995) 9:576; Hunter, Meth. Enzymol. 
(1991) 200:3: Hanks, et al. Meth. Enzymol. (1991 ) 200:38: Hanks. Curr. Opin. Struct. Biol. (1991) 
/:369; Hanks et aL Science ( 1988) 241:42) belong to a very extensive family of proteins that share a 
conserved catalytic core common to both serine/threonine and tyrosine protein kinases. There are a 
number of conserved regions in the catalytic domain of protein kinases. The first region, located in 
the N-terminal extremity of the catalytic domain, is a glycine-rich stretch of residues in the vic.nity 
of a lysine residue, which has been shown to be involved in ATP binding. The second region, 
located in the central part of the catalytic domain, contains a conserved an aspartic acid residue that 
is important for the catalytic activity of the enzyme (Knighton, et aL Science (1991) 253:407). 
The protein kinase profile mcludes two s.gnature patterns for this second region: one 
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specific for serine/threonine kinases and the other for tyrosine kinases. A third profile is based on 
the alignment in (Hanks, et al. FASEBJ. (1995) 9:576) and covers the entire catalytic domain. The 
consensus patterns are as follows: 1 ) fI.IVl-G-{P}-G-{P}-[FYWMGSTNH]-[SGA]-{PW!- 
[LIVCAT]-{PD}-x-[GSTACLIVMFY]-x(5.18)-[LIVMFYWCSTAR]-[AIVP]-[LlVMFAGClCRj-K. 
where K binds ATP: 2) [LIVMFYC)-x-[HY]-x-D-lLlVMFY]-K-x(2)-N-(LIVMFYCTl(3). where D 
is an active site residue: and 3) [LIVMFYCJ-x-lHY]-x-D-[LIVMFY]-[RSTAC]-x<2)-N- 
[LIVMFYC]. where D is an active site residue. 

Protein Tyrosine Phosphatase (Y phosphatase) (PTPase) . SEQ IDNOS:1719. 1769.2062. 
2197. and 2275 represent polynucleotides encoding a tyrosine-specific protein phosphatase, a kinase 
that catalyzes the removal of a phosphate groups attached to a tyrosine residue (EC 3.1.3.48) 
(PTPase) (Fischers al. Science ( 1991 ) 253:401: Charbonneau el ul.Annu. Rev. Cell Biol. (1992) 
«:463: Trowbridge Biol. Chem. (1991) 266:23517: Tonks el al. Trends Biochem. Set. (1989) /4:497: 
and Hunter. Cell ( 1989) 58: 1013). PTPases are important in the control of cell growth, proliferation, 
differentiation and transformation. Multiple forms of PTPase have been characterized and can be 
classified into two categories: soluble PTPases and transmembrane receptor proteins that contain 
PTPase domain(s). Structurally, all known receptor PTPases are made up of a variable length 
extracellular domain, followed by a transmembrane region and a C-termmal catalytic cytoplasmic 
domain. PTPase domains consist of about 300 amino acids. Two conserved cysteines are absolutely 
required for activity, with a number of other conserved residues in the immediate vicinity also 
important for activity. The consensus pattern for PTPases is: [LIVMF]-H-C-x(2)-G-x(3)-[STC]- 
[STAGPl-x-[LIVMFY]; C is the active site residue. 

RNA Recognition Motif (nm). SEQIDNOS: 1850 and 2194 correspond to sequence 
encoding an RNA recognition motif, also known as an RRM. RBD. or RNP domain. This domain, 
which is about 90 amino acids long, is contained in eukaryotic proteins that bind single-stranded 
25 RNA (Bandziulis et al. Genes Dev. (1989) 3:43 1-437; Dreyfuss et al. Trends Biochem. Sci. (1988) 
7J:86-9! ). Two regions within the RNA-binding domain are highly conserved: the first is a 
hydrophobic segment of six residues (which is called the RNP-2 motif), the second is an 
octapeptide motif (which is called RNP-1 or RNP-CS). The consensus pattern is: [RK]-G- 
{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYLM]. 
3Q SH2 Domain (SH2) SEQ ID NO 2441 corresponds to a sequence encoding an SH2 

domain. The Src homology 2 (SH2) domain includes an approximately 100 amino acid residue 
domain, which is conserved in the oncoproteins Src and Fps, as well as in many other intracellular 
signal-transducing proteins (Sadowski et al. Mol Cell. Biol. (1986) 6:4396-4408; Russel et al. 
FEBS Lett. ( 1 992) 304: 1 5-20). SH2 domains function as regulatory modules of intracellular 
35 signaling cascades by interacting with high affinity to phosphotyrosine-containing target peptides in 
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a sequence-specific and strictly phosphorylation-dependent manner. The SH2 domain has a 
conserved 3D structure consisting of two alpha helices and six to seven beta-strands. The core of 
the domain is formed by a continuous beta-meander composed of two connected beta-sheets 
(Kuriyan et al. Curr Opw. Struct. Biol. (1993) 3:828-837). 

Thioredoxin family active site (Thioredox). SEQ ID NO: 1618 represents a polynucleotide 
encoding a protein of the thioredoxin family. Thioredoxins arc small proteins of approximately one 
hundred amino acid residues that participate in various redox reactions via the reversible oxidation 
of an active center disulfide bond (Holmgren. Anna. Rev. Biochem. (1985) 54:231: Gleason. et al., 
FEMS Microbiol. Rev. (mi) 5-7:271: Holmgren A. J. Biol. Cham. (1989) 264: 13963: Eklund. et al. 
Proteins (1991 ) 77: 13). Thioredoxins exist in either reduced or oxidized forms where the two 
cysteine residues are linked in an intramolecular disulfide bond. The sequence around the redox- 
active disulfide bond is well conserved. The consensus pattern is: [LlVMF]-[LIVMSTA]-x- 
[LIVMFYC]-[FYWSTHE]-x(2)-[FYWGTNJ-C- [GATPLVE]-[PHYWSTA]-C-x(6)-[LIVMFYWT] 
(where the two C's form the redox-active bond). 

Trypsin (trypsin) . SEQ ID NOS: 1 579, 22%, 234 1 , 242 1 . 2430. and 2438 correspond to 
novel serine proteases of the try psin family. The catalytic activity of the serine proteases from the 
trypsin family is provided by a charge relay system involving an aspartic acid residue hydrogen- 
bonded to a histidine, which itself is hydrogen-bonded to a serine. The sequences in the vicinity of 
the active site serine and histidine residues are well conserved (Brenner Nature (1988) 337:528). 
The consensus patterns for the trypsin protein family are: 1 ) [L!VM]-[ST]-A-[STAG1-H-C, where H 
is the active site residue: and 2) [DNSTAGC]-[GSTAPIMVQH]-x(2)-G-[DE]-S-G-[GS]-[SAPHV]- 
[LI VMFY WH]-[LIVMFYSTANQH]. where S is the active site residue. All sequences known to 
belong to this family are detected by the above consensus sequences, except for 1 8 different 
proteases which have lost the first conserved glycine. If a protein includes both the serine and the 
histidine active site signatures, the probability of it being a trypsin family serine protease is 100%. 

WD Domain. G-Beta Repeats (WD domain) . SEQ ID NO: 228 1 represents a members of 
the WD domain/G-beta repeat family. Beta-transducin (G-beta) is one of the three subunits (alpha, 
beta, and gamma) of the guanine nucleotide-binding proteins (G proteins) which act as 
intermediaries in the transduction of signals generated by transmembrane receptors (Gilman, Annu. 
Rev Biochem (1987) .56:61 5). The alpha subunit binds to and hydrolyzes GTP; the beta and gamma 
subunits are required for the replacement of GDP by GTP as well as for membrane anchoring and 
receptor recognition. In higher eukaryotes, G-beta exists as a small multigene family of highly 
conserved proteins of about 340 amino acid residues. Structurally, G-beta has eight tandem repeats 
of about 40 residues, each containing a central Trp-Asp motif (this type of repeat is sometimes 
called a WD-40 repeat). The consensus pattern for the WD domain/G-Beta repeat family is: 
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(LIVMSTAC]MLIVMFYWSTAGCl-[LIMSTAC^-[lJVMSTAGC]-x(2)-[DN|-x(2V 
[LlVMWSTAC]-x-[LIVMFSTAG]-W-[DLN]-[LIVMI : STAGCNl. 

wnt Family of Developmental Sjanahjng Proteins (Wnt dev si»n). Several of the sequences 
correspond to novel members of the wnt family of developmental signaling proteins. Wnt-1 

5 (previously known as int-1). the seminal member of this family, (Nusse. Trends Genet. (1988) 
7:291) plays a role in intercellular communication and is important in central nervous system 
development. All wnt family proteins share the following features characteristic of secretory 
proteins: a signal peptide, several potential N-glycosylation sites and 22 conserved cysteines that 
mav be involved in disulfide bonds. Wnt proteins generally adhere to the plasma membrane of 

10 secreting cells and are therefore likely to signal over only few cell diameters. The consensus 

pattern, which is based upon a highly conserved region including three cysteines, is as follows: C-K- 
C-H-G-[LIVMT]-S-G-x-C. 

Zinc Finger. C2H2 T\pe (Zincfinu C2H2) . SEQ 1DNOS: 1735, 1942. 2018. 2254. and 
2515 correspond to polynucleotides encoding members of the C2H2 type zinc Finger protein family, 

15 which contain zinc finger domains that facilitate nucleic acid binding (KJug et a!.. Trends Biochem. 
Sci. (1987) 72:464: Evans et a/., Cell( 1988) 52:1: Payre et a!., FEBS Lett. (1988)25^:245: Miller ^ 
ai.. EMBO J. ( 1 985) 4: 1 609: and Berg, Proc. Natl Acad. Sci. USA ( 1 988) 85:99). In addition to the 
conserved zinc ligand residues, a number of other positions are also important for the structural 
integrity of the C2H2 zinc fingers. (Rosenfeld et aL.J BiomoL Struct. Dvn. (1993) //:557) The 

20 best conserved position, which is generally an aromatic or aliphatic residue, is located four residues 
after the second cysteine. The consensus pattern for C2H2 zinc fingers is: C-x(2.4)-C-x(3)- 
[LIVMFYWCl-\(8)-H-x(3.5)-i \. The two Cs and two H's are zinc ligands. 

Example 4: Differential Expression of Polynucleotides of the Invention: Description of 
25 Libraries and Detection of Differential Expression 

The relative expression levels of the polynucleotides of the invention was assessed in 

several libraries prepared from various sources, including cell lines and patient tissue samples. 

Table 4 provides a summary of these libraries, including the shortened library name (used hereafter), 

the mRNA source used to prepared the cDNA library, the "nickname" of the library that is used in 
30 the tables below (in quotes), and the approximate number of clones in the library. 

Table 4. Description of cDNA Libraries 



Library 
(lib H) 


Description 


Number of 
Clones in 
Cluster 


1 


ICml2L4 

Human Colon Cell Line, High Metastatic Potential (derived from 
Kml2C): "Hich Met Colon' 1 


307133 
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Library 


Description 


Number of 
Clones in 
Cluster 


(lib#) 


j 

! 


*> j 


Kml2C 

Human Colon Ceil Line. Low Metastatic Potential; "Low Met Colon" 


i 

284755 | 


i 
i 


MDA-MB-231 \ 
Human Breast Cancer Cell Line. High Metastatic Potential: micro- j 
metastases in lunu: "Hiah Met Breast" !_ 


i 

326937 

i 


4 | 

I 


MCF7 j 
Human Breast Cancer Cell. Non Metastatic: "Low Met Breast" \ 


! 

318979 


8 


MV-522 ! 
Human Lung Cancer Cell Line. High Metastatic Potential: "High Met 
Luno" : 


223620 

j 


I 


UCP-3 ! 
Human Lung Cancer Cell Line. Low Metastatic Potential: "Low Met 


! 

312503 


1 


Luns" 1 


i 


12 


Human microvascular endothelial cells (HMLC ) - Untreated j 
PCR (OlieodT) cDNA librae: "HMEC" 


41938 


i 13 
j 


Human microvascular endothelial cells (HMLC) - Basic fibroblast 
growth factor (bFGF) treated j 
PCR (OlieodT) cDNA library: "HMEC-bFGF" _j 


42100 


14 


Human microvascular endothelial cells (HMLC) - Vascular endothelial 

growth factor (VEGF) treated 

PCR (OligodT) cDNA library: "HMEC-VEGF" 


42825 ; 

i 

— j 


15 


Normal Colon - UC#2 Patient 

PCR (OlieodT) cDNA librarv; "Normal Colon Tissue" 


I 

282722 


16 


Colon Tumor - L02 Patient 

PCR (OligodT) cDNA librarv; "Normal Colon Tumor Tissue" 


298831 


17 


Liver Metastasis from Colon I umor of UC#2 Patient 
PCR (OlieodT) cDNA library; "Hieh Met Colon Tissue" 


• 

303467 ! 


18 


j Normal Colon - UC#3 Patient 

! PCR (OlieodT) cDNA librarv: "Normal Colon Tissue" 


! 

36216 


19 


i Colon Tumor - UC#3 Patient 
PCR (OlieodT) cDNA librarv; "Colon Tumor Tissue'' 


| -+ 1 JOO 


20 


Liver Metastasis from Colon Tumor of L03 Patient 
PCR (OlieodT) cDNA library; "High Met Colon Tissue" 


~j 

30956 


21 


GRRpz 

Human Prostate Cell Line: "Normal Prostate** 


164801 


22 


Woca 

Human Prostate Cancer Cell Line: 'Prostate Cancer" 


; 162088 



The 1CM12L4. KM12C and MDA-MB-231 cell lines are described in Example 1 above. The 
MCF7 cell line was derived from a pleural effusion of a breast adenocarcinoma and is non- 
metastatic. The MV-522 cell line is derived from a human lung carcinoma and is of high metastatic 
potential. The UCP-3 cell line is a low metastatic human lung carcinoma cell line: the MV-522 is a 
high metastatic variant of UCP-3. These cell lines are well-recognized in the art as models for the 
study of human breast and lung cancer (see, e.g.. Chandrasekaran et aL Cancer Res. (1979) 39:870 
(MDA-MB-23 I and MCF-7): Gastpar et aL JMedChem (1998) 41:4965 (MDA-MB-231 and 

57 



WO 99/58675 



PCT/US99/10602 



MCF-7): Kanson etaL. Br J Cancer ( 1998) r~ 1 586 (MDA-MB-23 1 and MCF-7): Kuang et «/., 
Nucleic Acids Res (1998) 26:1 1 16 (MDA-MB-23 1 and MCF-7): Varki wa/.. hit J Cancer { 1987) 
-/0:46 (UCP-3): Varki «7 a/.. 'A/mo?/r /?/W (1990) 77:327; (MV-522 and UCP-3): Varki cV t//.. 
Anticancer Res. (1990) /tf 637; (MV-522); Kelner e/ a/.. Anticancer Res (1995) 75:867 (MV-522); 
and Zhang el aL Anticancer Drugs (1997) #:696 (MV522)). The samples of libraries 1 5-20 are 
derived from two different patients <102, and UC#3). The bFGF-treated HMEC were prepared by 
incubation with bFGF at lOng/ml for 2 hrs; the VEGF-treated HMEC were prepared by incubation 
with 20ng/ml VEGF for 2 hrs. Following incubation with the respective growth factor, the ceils 
were washed and lysis buffer added for RNA preparation. The GRRpz and WOca cell lines were 
prov ided bv Dr. Donna M Peehl. Department of Medicine. Stanford University School of Medicine. 
GRRpz was derived from normal prostate epithelium. The WOca cell line is a Gleason Grade 4 ceil 
line. 

Each of the libraries is composed of a collection of cDNA clones that in turn are 
representative of the mRNAs expressed in the indicated mRNA source. In order to facilitate the 
analysis of the millions of sequences in each library, the sequences were assigned to clusters. The 
concept of "cluster of clones" is derived from a sorting/grouping of cDNA clones based on their 
hybridization pattern to a panel of roughly 300 7bp oligonucleotide probes (see Drmanac et aL. 
Genomics (1996) 37(1 ):29). Random cDNA clones from a tissue library are hybridized at moderate 
stringency to 300 7bp oligonucleotides. Each oligonucleotide has some measure of specific 
hybridization to that specific clone. The combination of 300 of these measures of hybridization for 
300 probes equals the "hybridization signature" for a specific clone. Clones with similar sequence 
w ill have similar hybridization signatures. By developing a sorting/grouping algorithm to analyze 
these signatures, groups of clones in a library can be identified and brought together 
computationally. These groups of clones are termed "clusters". Depending on the stringency of the 
selection in the algorithm (similar to the stringency of hybridization in a classic library cDNA 
screening protocol), the "purity" of each cluster can be controlled. For example, artifacts of 
clustering may occur in computational clustering just as artifacts can occur in "wet-lab" screening of 
a cDNA library with 400 bp cDNA fragments, at even the highest stringency. The stringency used 
in the implementation of cluster herein provides groups of clones that are in general from the same 
cDNA or closely related cDNAs Closelv related clones can be a result of different length clones of 
the same cDNA. closely related clones from highly related gene families, or splice variants of the 
same cDNA. 

Differential expression for a selected cluster was assessed by first determining the number 
of cDNA clones corresponding to the selected cluster in the first library (Clones in 1 st ). and the 
determining the number of cDNA clones corresponding to the selected cluster in the second library 
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(Clones in 2™1) Differential expression of the selected cluster in the first library relative to the 
second librarv is expressed as a "ratio" of percent expression between the two libraries. In general, 
the "ratio" is calculated by: 1 ) calculating the percent expression of the selected cluster in the first 
library by dividing the number of clones corresponding to a selected cluster in the first library by the 
total number of clones analyzed from the first library; 2) calculat.ng the percent expression of the 
selected cluster in the second library by dividing the number of clones corresponding to a selected 
cluster in a second library by the total number of clones analyzed from the second l.brary; 3) 
dividing the calculated percent expression from the first library by the calculated percent expression 
from the second library. If the "number of clones" corresponding to a selected cluster in a library is 
zero, the value is set at I to aid in calculation The formula used in calculating the ratio takes into 
account the "depth" of each of the libraries being compared, i.e.. the total number of clones analyzed 
in each library. 

In general, a polynucleotide is said to be significantly differentially expressed between two 
samples when the ratio value is greater than at least about 2. preferably greater than at least about 3. 
more preferably greater than at least about 5 . where the ratio value is calculated using the method 
described above. The significance of differential expression is determined using a z score test (Zar. 
Riostatistical Analysis . Prentice Hall. Inc.. USA, "Differences between Proportions. " pp 296-298 
(1974). 

Examples 5-12: Differential Expression of Polynucleoti des of the Invention 

A number ofpolynucleot.de sequences have been identified that are differentially expressed 
between, for example, cells derived from high metastatic potential cancer tissue and low metastatic 
cancer cells, and between cells derived from high metastatic potential cancer tissue and normal 
tissue. Evaluation of the levels of expression of the genes corresponding to these sequences can be 

i valuable in diagnosis, prognosis, and/or treatment {e.g.. to facilitate rationale design of therapy, 
monitoring during and after therapy, etc. ). Moreover, the genes corresponding to differentially 
expressed sequences described herein can be therapeutic targets due to their involvement in 
regulation (e.g., inhibition or promotion) of development of, for example, the metastatic phenotype. 
For example, sequences that correspond to genes that are increased in expression in high metastatic 

■0 potential cells relative to normal or non-metastatic tumor cells mav encode genes or regulatory 
sequences involved in processes such as angiogenesis. differentiation, cell replication, and 
metastasis. 

Detection of the relative expression levels of differentially expressed polynucleotides 
described herein can provide valuable information to guide the clinician in the choice of therapy. 
5 For example, a patient sample exhibiting an expression level of one or more of these poly nucleotides 
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that corresponds to a gene that is increased in expression in metastatic or high metastatic potential 
ceils may warrant more aggressive treatment for the patient. In contrast, detection of expression 
levels of a polynucleotide sequence that corresponds to expression levels associated with that of low 
metastatic potential cells may warrant a more positive prognosis than the gross pathology would 
5 suggest. 

A number of polynucleotide sequences of the present invention are differentially expressed 
between human microvascular endothelial cells (HMEC) that have been treated with growth factors 
relative to untreated HMEC. Sequences that are differentially expressed between growth factor- 
treated HMEC and untreated HMEC can represent sequences encoding gene products involved in 

10 angiogenesis. metastasis (cell migration), and other development and oncogenic processes. For 
example, sequences that are more highly expressed in HMEC treated with growth factors (such as 
bFGF or VEGF) relative to untreated HMEC can serve as markers of cancer cells of higher 
metastatic potential. Detection of expression of these sequences in colon cancer tissue can be 
valuable in determining diagnostic, prognostic and/or treatment information associated with the 

I 5 prevention of achieving the malignant state in these tissues, and can be important in risk assessment 
for a patient. A patient sample displaying an increased level of one or more of these polynucleotides 
may thus warrant closer attention or more frequent screening procedures to catch the malignant state 
as early as possible. 

The differential expression of the polynucleotides described herein can thus be used as, for 
20 example, diagnostic markers, prognostic markers, for risk assessment, patient treatment and the like. 
These polynucleotide sequences can also be used in combination with other known molecular 
and/or biochemical markers. The following examples provide relative expression levels of 
polynucleotides from specified cell lines and patient tissue samples. 

25 Example 5: High Metastatic Potential Breast Cancer Versus Low Metastatic Breast Cancer Cells 
The following tables summarize polynucleotides that represent genes that are differentially 
expressed between high metastatic potential and low metastatic potential breast cancer cells. 
Table 5. High metastatic potential breast (lib3) > low metastatic potential (lib4) breast cancer cells 



SEQ ID NO: 


Lib3 Clones 


Lib4 Clones 


Lib3/Lib4 


1213 


40 


0 


39 


1538 


60 


3 


20 


1466 


14 


0 


14 


1356 


10 


0 


10 


1383 


10 


1 


10 


1158 


10 


1 


10 


441 


10 


1 


10 


1338 


10 


0 


10 


1426 


19 


2 


9 
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SEQ ID NO: 


Lib3 Clones 


Lib4 Clones 


Lib3/Lib4 


1547 


9 


1 


0 


1313 


8 


1 


8 


841 


8 


1 


8 


1534 


8 




8 


1 503 


8 


A 


8 


829 


8 


1 


8 


1408 


8 


A 

0 


8 


1447 


7 


ft 

0 


/ 


1389 


7 


A 


/ 


356 


7 


0 


/ 


14 l >2 


7 


0 


7 


1 543 




.1 


7 


79Q 


7 


0 


7 


1437 


0 


0 


6 


1251 


6 


0 


6 


972 


18 


.1 


6 


1482 


0 


0 


6 


1 2 ( W 


(1 


0 


6 


109 


24 


4 


6 


1558 


6 


0 


6 


1355 


(> 


0 


6 


1 548 


11 




5 


250 


10 


*» 


5 


919 


26 


I 6 


4 


358 


36 


12 


•> 
.> 


1525 


75 


28 


■> 


1 157 


4^ 


17 


3 



Table 6. Low metastatic potential breast (Iib4) > high metastati c potential breast cancer cells (lib3) 



SEQ ID NO: 


Lib3 Clones 


Lib4 Clones 


Lib4/Lib3 


248 


0 


58 


5Q 


726 


1 


23 


24 


14 


1 


19 


1*> 


699 


0 


14 


14 


763 


1 


14 


14 


20 


] 


13 


13 


79 


1 


13 


13 


715 


0 


10 


10 


991 


0 


8 


8 


1199 


0 


8 


8 




'} 


7 


'7 


1128 


4 


26 


7 


891 


0 


6 


6 


1 146 


2 


1 1 


6 


731 


I 1 


44 


6 


1518 


3 


15 


5 


340 


3 


13 


4 


949 


4 


13 


:> 
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SEQ II) NO: 


Lib3 Clones 


Lib4 Clones 


Lib4/Lib3 


1247 


7 


18 


3 


1 185 


497 


1216 


3 



Example 6: Hi»h Metastatic Potential Luna Cancer V e rsus Low Metastatic Lung Cancer Cells 

The following summarizes polynucleotides that represent genes differentially expressed 
between high metastatic potential lung cancer cells and low metastatic potential lung cancer cells: 
Table 7. High metastatic potential jungjlibg) > low m e tastatic potential lungjjjbgj luniz cancer cells 



SEQ ID 

NO: 


Lib8 Clones 


Lib9 Clones 


Lib8/Lib9 


150 


31 


0 


43 


[ 651 


43 


2 


30 


1298 


14 


1 


20 


' 57 


11 


0 


15 


625 


7 


0 


10 


1322 


i 

! 


1 


10 


36 


1 


1 0 


10 


621 


18 


3 


8 


215 


L_ 6 


1 


8 


561 


19 


4 


7 


247 


5 


0 


7 


199 


5 


[ 0 


7 


998 


5 


0 


7 


502 


5 


L 0 


7 


1382 


8 


2 


6 


1 181 


17 


4 


6 


1309 


8 


2 


6 


1157 


15 


4 


5 


1260 


14 


5 


4 


1185 


710 


1 266 


4 


1525 


21 


10 


3 



Table 8 



l ow metastatic potential lunP Qibg) > hi°h m etastatic potential lung (Iih8) cancer cells 



SEQ ID 
NO: 


Lib8 Clones 


Lib9 Clones 


Lib9/Lib8 


924 


1 


13 


9 


822 


1 


13 


9 


728 


L 1 


12 


9 


341 


1 


12 


9 


1527 


3 


31 




698 


4 


26 


5 


949 




15 


5 


744 


-> 


23 


5 


973 


L 8 


27 


2 
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Example 7: High Metastatic Potential Colon Cancer Versus Low Metastatic Colon Cancer Cells 

Tables 9 and 10 summarize polynucleotides that represent genes differentially expressed 
between high metastatic potential and low metastatic potential colon cancer cells: 
Table 9. High metastatic potential (jiblj > low metastatic potential (lib?) colon cancer cells 



SEQ ID NO: 


Libl Clones 


Lib2 Clones 


Libl/Lib2 


248 


67 


~> 


31 


87 


12 


0 


11 


698 


11 


0 


10 


57 


13 


-» 


4 


924 


24 


10 


2 


1249 


24 


9 


2 



Table 10. Low metastatic potential (lib2) > high metastatic potential colon cancer (libl ) cells 



SEQ ID NO: 


Libl Clones 


Lib2 Clones 


Lib2/Libl 


1268 


1 


17 


18 


1114 


0 


15 


16 


1032 


1 


14 


15 


109 


5 


60 


13 


973 


1 


11 


12 


91 


1 


11 


12 


982 


0 


9 


10 


1267 


3 


28 


10 


93 


1 


8 


9 


1556 


1 


8 


9 


1251 


0 


8 


9 


1206 


2 


17 


9 


812 


0 


8 


9 


1254 


0 


7 


8 


1220 


0 


7 


8 


766 


0 


7 


8 


1 156 


0 


7 


8 


1007 


0 


7 


8 


981 


0 


7 


8 


762 


0 


7 


8 


876 


0 


6 


6 


1234 


2 


11 


6 


1183 


0 


6 


6 


1044 


2 


12 


6 


785 


u 


<> 


6 


1069 


*» 


17 


6 


770 


0 


6 


6 


778 


0 


() 


6 


792 


0 


6 


6 


822 


2 


10 


5 


1258 


7 


23 


4 


1224 


7 


17 


3 
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SEQ II) NO: 


Libl Clones 


Lib2 Clones 


Lib2/Libl 


984 


8 


19 




841 


10 


28 


3 


339 


14 


34 


3 


1213 


1 ! 


29 


3 


1201 


5 


14 


3 


1192 


22 


48 


2 



Example 8: High Metastatic Potential Colon Cancer Patient Tissue Vs. Normal Patient Tissue 

Tables 1 1 summarizes polynucleotides that represent genes differentially expressed between 
high metastatic potential colon cancer cells and normal colon cells of patient tissue. : 
Table 11. High metastatic potential colon tissue (libl 7) vs. normal colon tissue (libl 5) 



SEQ ID NO: 


LiblS Clones 


Libl7 Clones 


Libl7/LibI5 


1422 


1 


13 


12 


1132 


1 


10 


9 


730 


1 


9 


8 


1311 


0 


7 


7 


78 


9 


48 


5 


822 


5 


20 


4 


SEQ ID NO: 


LiblS Clones 


Libl7 Clones 


Libl5/Libl7 


463 


8 


1 


9 



Example 9: High Tumor Potential Colon Tissue Vs. Metastasized Colon Cancer Tissue 
The following table summarizes polynucleotides that represent genes differentially 
expressed between high tumor potential colon cancer eels and cells derived from high metastatic 
10 potential colon cancer cells of a patient. 

Table 12. High tumor potential colon tissue (libl 6) vs. high metastatic colon tissue (libl 7) 



SEQ ID NO: 


Libl6 Clones 


Libl7 Clones 


Libl6/Libl7 


1185 


14 


4 


4 


SEQ ID NO: 


Libl6 Clones 


Libl7 Clones 


Libl7/Libl6 


822 


2 


20 


10 



Example 10: High Tumor Potential Colon Cancer Patient Tissue Versus Normal Patient Tissue 
Tables 13 and 14 summarize polynucleotides that represent genes differentially expressed 
15 between high metastatic potential colon cancer cells and normal colon cells in patient tissue: 

Table 13. Higher expression in tumor potential colon tissue (jjbl6j vs. normal colon tissue (lib!5) 



SEQ ID NO: 


Libl5 Clones 


Libl6 


Libl6/Libl5 






Clones 




131 1 


0 


8 


8 


78 


9 


28 


3 
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Table 14. Higher expression in normal colon tissue (lib! 3) vs. tumor potential colon tissue (lib!6) 



SEQ ID NO: 


Libl5 Clones 


Lib 16 Clones 


Libl5/Libl6 


463 


8 


0 


8 


1099 


i: 


-» 


4 



Example 1 1: Growth Factor-Stimulated Human Microvascular Endothelial Cells (HMEC) 
5 Relative to Untreated HMEC 

The following tables summarize polynucleotides that represent genes differential ly 
expressed between growth factor-treated and untreated HMEC 

Table 15. Higher expression in bFGF treated HMEC (lib!3) vs. untreated HMEC (\\b\2) 



SEQ ID NO: 


Libl2 Clones 


Libl3 Clones 


Libl3/Libl2 


1520 


9 


23 


3 


1538 


17 


35 





1 0 Table 16. Higher expression in VEGF treated HMEC (lib!4) vs. untreated HMEC(libl2) 



SEQ ID NO: 


Libl2 Clones 


Libl4 Clones 


Libl4/Libl2 


1154 


2 


12 


6 


1226 


*) 


10 


5 


1538 


17 


38 


i 



Example 12: Polynucleotides Differentially Expressed in Human Prostate Cancer Cells Relative 
to Normal Human Prostate Cells 

The following tables summarize identified polynucleotides that represent genes 
1 5 differentially expressed between prostate cancer cells and normal prostate cells: 

Table 17. Higher expression in normal prostate cells (Iib2 1) relative to prostate cancer cells (lib22) 



SEQ ID NO: 


Lib21 Clones 


Lib22 Clones 


Lib21/Lib22 


1525 


6 


0 


6 


248 


116 


51 


2 


1203 


22 


9 


2 



Table 18 Higher expression in prostate cancer cells (lib22) relative to normal pros tate cells (lib21 ) 



SEQ ID NO: 


Lib21 Clones 


Lib22 Clones 


Lib22/Lib21 


1213 


0 


34 


35 


340 




i «. 


12 


699 


0 


1 1 


11 



20 Example 13: Differential Expression Across Multiple Libraries 

A number of polynucleotide sequences have been identified that represent genes that are 
differentially expressed across multiple libraries. Expression of these sequences in a tissue or any 
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oriain can be valuable in determining diagnostic, prognostic and/or treatment information associated 
with the prevention of achieving the malignant state in these tissues, and can be important in risk 
assessment for a patient. These polynucleotides can also serve as non-tissue specific markers of. for 
example, risk of metastasis of a tumor. Table 19 summarizes this data. 



Table 19. Genes Differentially Expressed Across Multiple Library Comparisons 



ern in NO- 


f .all nr Ticcup ^iimnlp and C ancer State (omDared 


Ratio 


5 / 


Hioh MpI 1 una . iihKI 1 ow Met I unp flib9^ 


15 


j / 


Hioh Mpt Colon Hihl I > i nw Met Colon (lib2^ 


4 


/o 


u.rtk M<-»» ( '/ilnn l iccup /liVii7^ > Normal Colon T l^ue ( hbl 5^ 


S 


/o 


M^rrv-ol rV\l_\ri Xi i m /\r Xicci ip fTi h 1 A \ > Norma i Colon Tksiip Mini S ^ 
[Normal LOion i umor i issue uilhu/ ^ inui ma j \__uiui i i i^ul \ wvj \ j \ 


3 


i no 


Mlgn IViet DieaSl ^ 1 1 0 J> 1 - low ivici Dieasi ^iiuh; 


6 


1 09 


1 «u/ \/lc-t Cr\lr\r> /liK7'k -> VI i <t k \Af>1 Colon \ \ 

Low {viet L olon ( iiu^ ) ■> nign iviei v^oion \ \ iu i ) 


13 


"110 


H i&n [viet LOion ( no i ) low ivici v^oion uiuz. ; 


31 


! 24 f> 


[Norma! r rostate { UD~ 1 / <> r rosiate v_ anccr ^ na — i 




1 id 

24X 


Low rviet breast ( iid4 j > nign iviei Dreasi uio:> . 


59 


j ->4U 


r rostate Lancer ( Wv^.^. ) > iNormai rrosiaie ynv- i ; 


12 


34 0 


Low iviet ureast ( iiD4) > \ ngn iviei oreasi uidj i 


4 




Normal Lolon I issue uidi j) nign iviei luiuii i issue i nui / / 


9 


4oj 


[NOimal LOIOll l ISSUe 1 MDI j) •- INuriUdl v^vmUJI I uihui i dsuc \ muiui 


8 


.coo 


riitin jviet Loion ^ 1 1 o i i ^ low iviei ^.uiun uiu_-; 


10 




I nm N A r^-t 1 inwi /liUQ t . L-Itrtk. \ lint. ninX^ 

Low [viet Lung (iiov) nign iviei Lung ^nuo; 


5 




Low iviet Breast vliD4 ) <> nign iviei oreasi uiu_> ; 


14 


.coo 


r rostate Lancer n ID...-- ) iNonimi rruswic uiuz. i ) 


! 1 


0"> -> 

0__- 


u.rir. Cr\\c\\\ Xiccnp 1\ > Normal Cnlon Tumor Tissue Hibl 6^ 

n lull I V| c I V- UIOII I ISSUC UlUl i ; ' iMUIIlmi v^uiuii iuhi^i i ^iiui 


10 


OT) 


I \_1___t 1 1 1 »-» r» /liKOl *> Wir»h \A Pt i lino MlHR^ 

LOW iviei Lung ^IIDVI • nign iviei LUiiji ^nuo^ 


9 


ft.... 


I am/ \Ae*t r*n\f\n /'liK^ . > Hioh MpT Colon . lihl ) 
LOW iviei L. oion ^ no*, i nigu iviet wuiuu \ uu i ) 


5 




V\\n\\ Mpt Pr»lr_n Ticcnp / lihl 1\ > Normal Colon Tissue Hibl 5) 
ri 1 tin Iviei \^ UiUil l issue ^nui / / ' nunuai v_ui\jii j ijou^ v hl ' 1 ^ ' 


4 


0 1 1 
54 1 


n iiin iviet Dreasi \ iil>j> i ^ luw jvici oicasL v nut / 


8 


0 1 1 


1 \,1pt fnlnn i lih^ . > Hioh Mpt Colon I lib H 

LOW IV] ei \_ Olon ^ liu_- > nign ivici n^uiwii \ uv i ) 


3 


y^4 


Hioh Mpt Polon / lih M > I ow Met Colon ( Y\b2) 


~i 


924 


Low Met Lung (lib9) > High Met Lung (lib8) 


0 


049 


Low Met Lung (Jib9> > High Met Lung (lib8) 


5 


949 


Low Met Breast (Iib4) > High Met Breast (Iib3 ) 


3 


973 


Low Met Colon (lib2) > High Met Colon (libl ) 


12 


973 


Low Met Lung (lib9l > High Met Lung ( Iib8) 




] 157 


High Met Lung (Iib8) > Low Met Lung (Iib9) 


5 


1157 


High Met Breast (Iib3) > Low Met Breast (lib4) 


3 


1 185 


Normal Colon Tumor Tissue (libl 6) > High Met Colon I issue (hbl 7) 


4 


M85 


Hieh Met Lune MibX) > Low Met Lung(lib9) 


4 


1 185 


Low Met Breast (Iib4) > High Met Breast (Iib3) 


jt 


1213 


Hi^h Met Breast (Iib3) > Low Met Breast (Iib4) 


39 


1213 


Prostate Cancer (lib22) > Normal Prostate (Iib2 1 ) 


35 


1213 


Low Met Colon (lib2)> High Met Colon (libl ) 


3 


1251 


High Met Breast (lib3) > Low Met Breast (Iib4) 




1251 


Low Met Colon (Iib2) > High Met Colon (libl ) 


9 


1311 


Normal Colon Tumor Tissue (libi6) > Normal Colon Tissue (Iibl5) 


8 
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SEQ ID NO: 


Cell or Tissue Sample and Cancer State Compared 


Ratio 


131 1 


High Met Colon Tissue (hbl /)> Normal Loion 1 issue i iiDiD) 


i 


1525 


Normal Prostate (lib2 1) > Prostate Cancer (lib22) 


6 


1525 


High Met Lung (lib8) > Low Met Lung (Iib9) 


3 


1525 


High Met Breast (Iib3) > Low Met Breast (Iib4) 


3 


1538 


High Met Breast (Iib3) > Low Met Breast (Iib4) 


20 


1538 


HMEC-VEGF (libl4) > HMEC (libl2) 


2 


1538 


HMEC-bFGF (Iibl3)> HMEC(libl2) 





Key for Table 19: High Met - high metastatic potential; Low Met = low metastatic potential; met = 
metastasized; tumor - non-metastasized tumor; HMEC = human microvascular endothelial cell; 
bFGF - bFGF treated; VEGF - VEGF treated. 



5 Example 14: Identification of Contiguous Sequences Having a Polynucleotide of the Invention 
The novel polynucleotides were used to screen publicly available and proprietary databases 
to determine if any of the polynucleotides of SEQ IDNOS:26l 1-2707 would facilitate identification 
of a contiguous sequence, e.g.. the polynucleotides would provide sequence that would result in 5" 
extension of another DNA sequence, resulting in production of a longer contiguous sequence 

10 composed of the provided polynucleotide and the other DNA sequence(s). Contiging was performed 
using the Gelmerge application (default settings) of GCG from the Univ. of Wisconsin. 

Using these parameters, 97 contiged sequences were generated. These contiged sequences 
are provided as SEQ ID NOS:261 1-2707 (see Table 1C). Table 1C provides the SEQ ID NO of the 
contig sequence, the name of the sequence used to create the contig, and the accession number of the 

15 publicly available tentative human consensus (THC) sequence used with the sequence of the 
corresponding sequence name to provide the contig. The sequence name of Table 1C can be 
correlated with the SEQ ID NO: of the polynucleotide of the invention using Tables 1 A and 1 B. 

The contiged sequences (SEQ ID NOS:261 1-2707) thus represent longer sequences that 
encompass a polynucleotide sequence of the invention. The contiged sequences were then translated 

20 in all three reading frames to determine the best alignment with individual sequences using the 

BLAST programs as described above. The sequences were masked using the XBLAST program for 
masking low complexity as described above in Example I . Several of the contiged sequences were 
found to encode polypeptides having characteristics of a polypeptide belonging to a known protein 
families (and thus represent new members of these protein families) and/or comprising a known 

25 functional domain (Table 5B. inserted prior to claims). Thus tlie invention encompasses fragments 
fusions, and variants of such polynucleotides that retain biological activ ity associated with the 
protein family and/or functional domain identified herein. 

Descriptions of the profiles for the indicated protein families and functional domains are 
provided in Example 3 above. A description of the profile for PR55 is provided below . 
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Protein Phosphatase 2A Reizuiatorv Subunit PR35 (PR55) . Several of the contigs 
correspond to a sequence encoding a protein comprising a protein phosphatase 2 A (PP2A) 
reeulatorv subunit PR55. PP2A is a serine/threonine phosphatase involved in many aspects of 
cellular function including the regulation of metabolic enzymes and proteins involved in signal 

5 transduction. PP2A is a trimeric enzyme comprising a core composed of a catalytic subunit 

associated with a 65 Kd regulatory subunit (PR65. also called subunit A). This complex associates 
with a third variable subunit (subunit B). which confers distinct properties to the holoenzyme 
(Mayer-Jackel et al. Trends Cell Biol. (1994) 7:287-29 1). One of the forms of the variable subunit is 
a 55 Kd protein (PR55) which is highly conserved in mammals and may facilitate substrate 

10 recognition or targeting the enzyme complex to the appropriate subcellular compartment. The PR55 
subunit comprises two conserved sequences of 1 5 residues: one located in the N-terminal region, the 
other in the center of the protein. The consensus patterns are: E-F-D-Y-l.-K-S-L-E-I-E-E-K-I-N: 
and N-[AG]-H-[TA]-Y.H-I-N-S-I-S-fI-IVMl-N-S-D. 

Those skilled in the art will recognize, or be able to ascertain, using not more than routine 

1 5 experimentation, many equivalents to the specific embodiments of the invention described herein. 
Such specific embodiments and equivalents are intended to be encompassed by the following 
claims. 

All publications and patent applications cited in this specification are herein incorporated by 

reference as if each individual publication or patent application were specifically and individually 
20 indicated to be incorporated by reference. The citation of any publication is for its disclosure prior 

to the filing date and should not be construed as an admission that the present invention is not 

entitled to antedate such publication by virtue of prior invention. 

Although the foregoing invention has been described in some detail by way of illustration 

and example for purposes of clarity of understanding, it is readily apparent to those of ordinary skill 
25 in the art in light of the teachings of this invention that certain changes and modifications may be 

made thereto without departing from the spirit or scope of the appended claims. 

Deposit Information . The following materials were deposited with the American Type Culture 

Collection (CMCC = Chiron Master Culture Collection). 

30 Table 20. Cell Line s Deposited with ATCC 



Cell Line 


Deposit Date 


ATCC Accession No. 


CMCC Accession No. 


KM12L4-A 


March 19, 1998 


CRL- 12496 


11606 


Kml2C 


May 15, 1998 


CRL-I2533 


11611 


MDA-MB-231 


May 15, 1998 


CRL-12532 


10583 


MCF-7 

i 


October 9. 1998 


CRL-12584 


10377 
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In addition, pools of selected clones, as well as libraries containing specific clones, were 
assigned an "ES" number (internal reference) and deposited with the A I CC. Table 21 below 
provides the A I CC Accession Nos. of the ES deposits, all of which were deposited on or before 
May 13. 1999. The names of the clones contained within each of these deposits are provided in the 
tables numbered 22 and greater (inserted before the claims). 

Table 21: Pools of Clones and Libraries Deposited with ATCC on or before May 14. 1999 



ES# 


ATCC Accession # J ES # 


ATCC Accession # 




ATCC Accession # 


34 




41 




48 




35 




42 




49 




36 




43 




50 




37 




44 




51 




38 




45 




52 




39 




46 




53 




40 




47 




54 





and is not an admission that a deposit is required under 35 U.S.C. §1)2. The sequence of the 
polynucleotides contained within the deposited material, as well as the amino acid sequence of the 

1 0 polypeptides encoded thereby, are incorporated herein by reference and are controlling in the event 
of any conflict with the written description of sequences herein. A license may be required to make, 
use. or sell the deposited material, and no such license is granted hereby. 

Retrieval of Individual Clones from Deposit of Pooled Clones . Where the ATCC deposit is 
composed of a pool of cDNA clones or a library of cDNA clones, the deposit was prepared by first 

1 5 transfecting each of the clones into separate bacterial cells. The clones in the pool or library were then 
deposited as a pool of equal mixtures in the composite deposit. Particular clones can be obtained from 
the composite deposit using methods well known in the art. For example, a bacterial cell containing a 
particular clone can be identified by isolating single colonies, and identifying colonies containing the 
specific clone through standard colony hybridization techniques, using an oligonucleotide probe or 

20 probes designed to specifically hybridize to a sequence of the clone insert (e.g., a probe based upon 
unmasked sequence of the encoded polynucleotide having the indicated SEQ ID NO). The probe should 
be designed to have a T m of approximately 80 n C (assuming 2°C for each A or T and 4°C r or each G 
or C). Positive colonies can then be picked, grown in culture, and the recombinant clone isolated. 
Alternatively, probes designed in this manner can be used to PCR to isolate a nucleic acid molecule 

25 from the pooled clones according to methods well known in the art, e.g.. by purifying the cDNA from 
the deposited culture pool, and using the probes in PCR reactions to produce an amplified product 
having the corresponding desired polynucleotide sequence. 
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SEQ ID NO. 


Sample Name 


Overlap 


Clone Name 


1952 


1033 Dl l.spd: 188359 


VO 


M00003771D:A10 


1953 


1033 El 1 spo: 1 8837 1 


vo 


M00003773A:C09 


1954 


1033. Fl 1 .sp6: 188383 


VO 


M00003773B:E09 


1955 


1 033. G 1 l.sp6: 188395 


vo 


M00()03773B:G08 


1 056 


1 033.H 1 l.spo: 188407 


vo 


M00003773CG06 


1^57 


1033.A12.sph: 188324 


vo 


M00O03773D.C02 


1958 


802.EI2.sp6: 164799 


VNO 




1959 


802 F12.sp6:16481 1 


VNO 




I960 


802 G12.sp6 164823 


VO 


M00003784C:B09 


1061 


802.M12.sp6 164835 


vo 


M00003785D:E()l 


1062 


803 A l.sp6: 164932 


VNO 




1063 


803 Bl.sp6:lh4944 


VNO 




1064 


803.Cl.spb:Jo4956 


VNO 




]0()5 


1033.B12.spcv 188336 


VO 


M00003789C 


E03 


N66 


1033.C12.sph: 188348 


vo 


M00003790B 


F12 


1%7 


1033.D12.sph. 188360 


vo 


M00003793C 


Dl 1 


1068 


1033. Fl2.sph 188384 


vo 


M00003796B 


CO 7 


1069 


1033.G12.sph. 188396 


vo 


M00003790C 


H03 


1070 


1034.A0i.sph 188505 


vo 


M00003797D 


H06 


1071 


1034. BOl.sph 188517 


VNO 




1072 


1034.C01.sph. 188529 


VO 


M00003801D:F05 


1073 


1034.D01.sptv 188541 


VO 


M00003805A:G05 


1074 


1034.E01.spo: 188553 


vo 


M00003808C:D09 


1075 


1034. FOl.spb: 188565 


vo 


M00003800A:A12 


1076 


1034.G01.spO: 188577 


vo 


MO000380OA:H12 


1077 


l034.H01.spo 188589 


vo 


M00003809B:D08 


1078 


1034.A02 sph 188506 


vo 


M0000381 IB:E07 


1079 


1034. B02. sph. 188518 


vo 


M00003812B:F08 


1980 


I034.C02.sp6: 188530 


vo 


M00003812D:E08 


1081 


1034. D02 spoil 88542 


vo 


M000038!3D:A06 


1082 


1034. E02. sph 188554 


vo 


M00003815C:A06 


1083 


1034. F02. sph 188566 


VNO 




1084 


1034.G02.sp6: 188578 


VNO 




1085 


1034.H02.spb: 188590 


vo 


M00003818A:F09 


1086 


1034.A03.sp6: 188507 


vo 


M00003818B:A01 


1987 


1034 B03.sp(>. 188519 


vo 


M00003818C:E0O 


1988 


1034.C03.sph: 188531 


VNO 




1980 


1034.D03.spo: 188543 


V< ) 


M0O0O3819C:E04 


1990 


1034.E03.spo: 188555 


vo 


M00003819D.G09 


1991 


1034.F03.sp6:188567 


vo 


M00003820A:H04 


1992 


1034.H03.sph:188591 


vo 


M00003820D:E02 


1993 


1034.A04.spO:l 88508 


vo 


M00003821C:E04 


1994 


1034 B04.spo: 188520 


vo 


M00003822A:G05 
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We Claim: 



1 . A library of polynucleotides, the library comprising the sequence information of at least 
one of SEQ ID NOS. 1-2702. 

5 

2. The library of claim 1, wherein the library is provided on a nucleic acid array. 

3. The library of claim 1 , wherein the library is provided in a computer-readable format. 

10 4. The library of claim 1 , wherein the library comprises a polynucleotide corresponding to 

a gene differentially expressed in a cancer cell of high metastatic potential relative to a control cell, 
wherein the control cell is a normal cell or a cell of low metastatic potential, and wherein the 
sequence is selected from the group consisting of SEQ ID NOS: 1213, 1538, 1466, 1356, 1383, 
1158,441, 1338, 1426. 1547, 1313,841, 1534, 1503,829. 1408, 1447, 1389,356, 1492. 1543,799, 

15 1437, 1251,972, 1482, 1299, 109, 1558, 1355, 1548,250,919,358, 1525, 1157, 150,651, 1298, 
57, 625, 1322, 36,621.215,561,247, 199, 998,502, 1382, 1 181, 1309, 1157. 1260, 1185, 1525, 
248, 87, 698,57,924, 1249. 

5. The library of claim 1, wherein the library comprises a polynucleotide corresponding to 
20 a gene differentially expressed in a cancer cell of low metastatic potential relative to a control cell, 
wherein the control cell is a normal cell or a cell of high metastatic potential, and wherein the 
sequence is selected from the group consisting of SEQ ID NOS:248, 726, 14, 699, 763, 20, 79, 715, 
991, 1 199, 707, 1128,891. 1146, 731, 1518, 340.949, 1247, 1185.924, 822, 728, 341. 1527,698, 
949, 744,973. 1268, 1114, 1032, 109. 973,91,982, 1267, 93, 1556, 1251. 1206,812, 1254, 1220, 
25 766, 1 156, 1007, 981, 762, 876, 1234, 1 183, 1044. 785, 1069, 770. 778, 792, 822, 1258. 1224, 984, 
841,339, 1213, 1201, 1192. 



6. An isolated polynucleotide comprising a nucleotide sequence having at least 90% 
sequence identity to an identifying sequence of SEQ ID NOS : 1 -2707 or a degenerate variant or 

30 fragment thereof. 

7. A recombinant host cell containing the polynucleotide of claim 6. 

8. An isolated polypeptide encoded by the polynucleotide of claim 6. 

35 

9. An antibody that specifically binds a polypeptide of claim 8. 
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1 0. A vector comprising the polynucleotide of claim 6. 

1 1 . A polynucleotide comprising the nucleotide sequence of an insert contained in a clone 
deposited as ATCC accession number xx, xx, xx, xx, xx, xx, xx, xx, or xx. 

5 

12. A method of detecting differentially expressed genes correlated with a cancerous state 
of a mammalian cell, the method comprising the step of: 

detecting at least one differentially expressed gene product in a test sample derived from a 
cell suspected of being cancerous, where the gene product is encoded by a gene corresponding to a 

10 sequence of at least one of SEQ ID NOS: 1213, 1538. 1466, 1356, 1383, 1 158, 441, 1338, 1426, 

1547. 1313,841, 1534, 1503,829, 1408, 1447, 1389,356, 1492, 1543, 799, 1437, 1251,972, 1482, 
1299, 109, 1558, 1355, 1548,250,919,358, 1525, 1157, 150, 651, 1298.57, 625, 1322,36, 621, 
215.561,247, 199,998,502, 1382, 1 181, 1309. 1157, 1260, 1185, 1525.248.87, 698.57. 924. 
1249, 248, 726, 14,699, 763,20, 79,715,991, 1199, 707, 1128, 891, 1146. 731, 1518, 340, 949, 

15 1247, 1185,924,822,728,341, 1527,698,949,744,973, 1268, 1114, 1032, 109,973.91,982. 
1267,93, 1556, 1251, 1206, 812, 1254, 1220, 766, 1156, 1007,981,762. 876, 1234, 1183, 1044, 
785, 1069, 770, 778, 792, 822, 1258, 1224, 984, 841, 339, 1213, 1201, 1 192 

wherein detection of the differentially expressed gene product is correlated with a 
cancerous state of the cell from which the test sample was derived. 
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<222> (1) ... (618) 
<223 > n = A, T, C or G 



<400> 1970 

agnnnnnnaa tatttgaaaa gagtaattgg tttggaagga gacaaaatcc tcaccactag 60 

tccatcagat ttctttaaaa gccatagtta tactatagtg ataaaaacct gtgctacaca 120 

tccatttctc agcaacggct cctaggataa tcaatcatgg catactgcta atgccttgat 180 

tgcagctgat atggaggaaa tatgtttact cttttgctaa agtgaagttc actgcggagg 240 

tgccaatggg tcatgtttgg ttagaaggtg acaatctaca gaattctaca gattccaggt 300 

gctatggacc tattccatat ggactaataa gaggacgaat cttctttaag atttggcctc 360 

tgagtgattt tggattttta cgtgccagcc ctaatggcca cagattttct gatgattagt 420 

aagcatttat tcttttgact tgattattgn ctccttttca tgtgaattta ttactcccgt 480 

tgaaaccgtg tacttaccaa taaactattt gctnttccna anaaannann nnnnnnnnnn 540 

nnnnnnnaan nnaaaaannn nnnnnnnnnn nnnnnnnggn nnnnnccccc ccccccccct 6 00 

taaaaaangg ggggngtn 618 



<210> 1971 

<211> 796 

<212> DNA 

<213> Homo sapiens 

< 2 2 0 > 

<22 1 > misc_f eature 
<222> (1) ... (796) 
<223> n = A,T,C or G 



<400> 1971 

ntgttcgaat tctgnacnaa gaattcaagn cagcacgtat gtagcagatg atganntcta 60 

anctggatga tacntaatga ngtcagattt gnaatctaac ttngnggctg tgnntagggt 120 

gcaaggagna cttccangac ctatactcna ggcgccctgg gtnnantaan gnaaacnnnc 180 

tncntaaggn tggcccccac gtggggaggt ggagttncng aattattctg tgcgctaccg 240 

gccgggccta gacctgtgct gagagactga gtctgcatgt gcaccggtgg caanaanggg 3 00 

gnngatcgtg gccncacntg gngctgcaag tcttccatga cccttttgct tgttccgcat 360 

cctggaggcg gcaaaagggt gaaatccgca ttgatggcct caatgtggca gacattcggg 420 

cctccattga cctgcgctcc tcanctgacc attcatcccg caggaccccc atccntgttt 480 

ctcgggggga ccccttgccg ccattgaaac cttggaaccc cttttggcag cnttcttcag 540 

aagggaagga acanttttgg gtgggggctt tttgggancn ttnntccccc accctngcca 600 

ccaaccgttt ttgttgaang ccttccccaa accccgggca aaggcccctg gggatncttt 660 

tcccaaaatg gccttcaaaa aaangggccc gggggggaag naaatncttt caaaccgttn 720 

gggggnccca aaaaaggcca ancnttccgt gggtggccct tgggcccccn anaccccttt 780 

gttttcccca aaanaa 796 



<210> 1972 

<211> 681 

<212> DNA 

<213> Homo sapiens 

<220> 

<221> misc feature 
<222> (1) . . . (681) 
<223> n = A, T, C or G 



<400> 1972 

ttatcgaata agacacgagg gaggatgttg ncannnncta ntcgggaggc tgacgcagga 60 

gaatcgcttg aacctgggag gcagaggttg cagtgagctg agaccatgcc actgtactcc 120 

agcctgggca atagagcgag attctgtctc ccaaaaaaac aaaaaacaac aacaaaactt 180 
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